Tuesday, April 15, 2014

Feature extraction and normalization

I started to do some experiments on extracting the I-Vector. Since I will probably implement this extraction by myself, it is essential for me to have comparable results.

To understand how this procedure works, I illustrated a graph.
Of course to begin with speaker recognition and everything which includes that, it is necessary to get data.
Where do you get data from? Well the usual corpi do provide most data I used in my experiments. Mostly I consider NIST Corpus and Switchboard as a good source of already processed data.

This data comes mostly in form of speech wave files, either '.sph' or '.wav'.

These files are the raw data, so e.g. the recorded audio with a microphone or telephone etc. ( Switchboard and NIST use different methods ).

These files must be analyzed with the HTK toolkits program, called HCopy. HCopy can be used for many applications, where my application is the extraction of features out of the raw speech.
To achieve this, different extraction techniques can be used, namely MFCC ( Mel Frequency Cepstrum .. something ) and PLP ( I am not good at acronyms ). Both use different Frame sizes and frame banks.

So the question is, why do we actually extract features? Why don't we just use the raw input files?

Well the answer is quite simple: We just don't have enough resources. Overall the estimation process and the parameters are gigantic ( even if we reduce the parameters via feature extraction ). Overall we aim for reduction of the parameters,but in the same time try to maximize the amount of data we can include into our model.

After finishing the extraction, there is a problem, namely that our extracted features also include silence. When recording the data in a closed environment, we can probably verify that the resulting data won't have any kind of background noise.
Moreover there will always be silence in the recordings, which essentially doesn't hurt when calculating the cepstral log powers, since it wont affect them, yet the data size will increase, because every frame will result in a N-dimensional feature vector.
So we can reduce the amount of vectors to exclude the silence, or to put it differently, include just speech for our features.


After we removed the silence and have pure-speech feature files ( illustrated as new .plp ), we can normalize the means and the variances of our features. This is simply a re scaling of all vector factors so that the coefficients do not differ greatly from each other ( for latter use ). Experiments have shown that this normalization improves the performance.
After normalization of both mean and variances have finished, we get our input feature vectors and can process to train our model.



No comments:

Post a Comment