Wednesday, April 16, 2014

Tutorial : Training a UBM Model and Extract I-Vector with Alize 3.0

To train a Model and get familiar with I-Vector behavior I did train my model with the already available Alize 3.0 framework.

Alize provides every necessary binary file to do the training and extracting after compilation.
Also HTK can to be downloaded if, the precompiled binaries of Alize are not sufficient enough or simply won't work due to other architecture.

Also the config files of this Tutorial are needed, since I strongly oriented myself on that stuff there, but yet did automatize the whole process. I will point out the 'pain in the ass' things of the configurations.

After downloading and compiling, create a new directory where you like to estimate a model in (e.g. model).

execute the following commands in the console:

mkdir lib

mkdir lib/scp

mkdir lst

mkdir data

mkdir data/prm

mkdir data/lbl


After that we need to link the Alize binaries with this directory, so just use a softlink and execute :
ln -s <your path to Alize>/bin .

cp -r {YourIVectorDir} cfg
To begin the process it is necessary to generate ( if not already done ) a .scp file, which maps the given raw files ( either .wav or .sph ) files to the corresponding feature file splits ( plp, mfcc, or what you like ).

The format of this file is e.g. :

/dnn_data/8h-plp39-z/jaat-B_10065_10348.plp
where the last part signalizes the length of the speech utterance in that speakers full speech.

The following scripts are provided by me, firstly I wrote a script which generates a data.lst file ( simply cuts the given corpus filenames ),  concatenates the speech utterances and outputs the concatinated files into a directory.

Download these Scripts and run the following commands:



python generateUBM.py -i {path to .scp} -o data/prm/ -glst lst/data.lst
and after :
python TrainWorld_TrainTV_Train_IV.py
It is essential that all the config files in the cfg directory, which were downloaded from the tutorial, are fully copied to the working directory. The main error which probably will happen is the outofmemory exception. This is probably due to the configuration, if the mode is set to HTK, but the "useBigEndian" Parameter is still set on false, Alize can't read an EOF and therefore will allocate all memory available. Just check if the parameters are set to HTK for type and "usebigEndian" should be true. After the scripts ran, you should have a directory which is called "iv", where the ivectors are extracted to.

Tuesday, April 15, 2014

Feature extraction and normalization

I started to do some experiments on extracting the I-Vector. Since I will probably implement this extraction by myself, it is essential for me to have comparable results.

To understand how this procedure works, I illustrated a graph.
Of course to begin with speaker recognition and everything which includes that, it is necessary to get data.
Where do you get data from? Well the usual corpi do provide most data I used in my experiments. Mostly I consider NIST Corpus and Switchboard as a good source of already processed data.

This data comes mostly in form of speech wave files, either '.sph' or '.wav'.

These files are the raw data, so e.g. the recorded audio with a microphone or telephone etc. ( Switchboard and NIST use different methods ).

These files must be analyzed with the HTK toolkits program, called HCopy. HCopy can be used for many applications, where my application is the extraction of features out of the raw speech.
To achieve this, different extraction techniques can be used, namely MFCC ( Mel Frequency Cepstrum .. something ) and PLP ( I am not good at acronyms ). Both use different Frame sizes and frame banks.

So the question is, why do we actually extract features? Why don't we just use the raw input files?

Well the answer is quite simple: We just don't have enough resources. Overall the estimation process and the parameters are gigantic ( even if we reduce the parameters via feature extraction ). Overall we aim for reduction of the parameters,but in the same time try to maximize the amount of data we can include into our model.

After finishing the extraction, there is a problem, namely that our extracted features also include silence. When recording the data in a closed environment, we can probably verify that the resulting data won't have any kind of background noise.
Moreover there will always be silence in the recordings, which essentially doesn't hurt when calculating the cepstral log powers, since it wont affect them, yet the data size will increase, because every frame will result in a N-dimensional feature vector.
So we can reduce the amount of vectors to exclude the silence, or to put it differently, include just speech for our features.


After we removed the silence and have pure-speech feature files ( illustrated as new .plp ), we can normalize the means and the variances of our features. This is simply a re scaling of all vector factors so that the coefficients do not differ greatly from each other ( for latter use ). Experiments have shown that this normalization improves the performance.
After normalization of both mean and variances have finished, we get our input feature vectors and can process to train our model.



Tuesday, April 1, 2014

HTK..again

So after now having read quite some papers about I-Vector, Eigenvoices, Cluster Adaptive Training and UBM , I finally started to do something at least.


I am training a UBM Model with HTK right now, I did that already for a different task last semester, but this time it's simpler ( UBM has just 1 HMM state, yet massive amount of Gaussians ).
Naturally this model should be initialized with every Gaussian mixture, but experiments show that an increase in mixtures during the training actually improve performance and reduce the training time.



There already did exist a script (of Phil Woodland), which does the job of initializing the first prototypes and parameters. Furthermore it can estimate a whole model by re-estimation during constant increase of mixture components.
To speed up the process ( and hopefully do not loose too much accuracy ) I need to implement a different step size.
The usual step size is 4, so the amount of mixtures go in the sequence of : 1,4,8,12....MIXTURES, where MIXTURES is the maximum amount of Mixtures ( in my case 512/1024).

I changed the step size to the following sequence: 1,4,8,12,16,32,48,64,128,256...MIXTURES. So I still do begin with little mixture counts at the beginning and double them when the count reaches 16 ( 48 is an empirical exception ).