Wednesday, May 14, 2014

Kaldi + Eclipse, a very time consuming task

Kaldi and it's "structure"

After I finally read enough code of kaldi's I-Vector implementation, I started to input my own code.

KALDI is not what I would call naturally easily extendable. It has it's flaws when it comes to separating objects.

In my case, there is already a class called IVectorExtractor. I just need to modify one or two methods of this class, since it has a strong relationship with other I/O classes, it wouldn't be wise to rewrite the whole process.

So I did implement my methods by subclassing the Ivectorextractor class and was trying to write some tests for that.

Unfortunately it is not easy to simulate the test at all, since in the usual procedure for extracting i-vectors, the perquisites are huge, we need to train a ubm, then estimate the T matrix and then we can extract the i-vector.

My thoughts were:

  1. Train a usual model and test your own class after you estimated the T-Matrix ( naturally the easiest solution)
  2. Write a test which dumps random data and estimates that
Either way, in the end I needed to debug the code, and boy did it take time to do so.

So recapitulate again:

  • Kaldi works for Gridengine, which is a batch parallel processing framework
  • Batch Processing needs to be done by generating your binary executable files and execute them in parallel
  • This again generates a lot of code, which constists of a class with an executable main method and a (usually ) class.
  • The dumped binary is usually just one part of a chain execution, since we can't simply do all work in one main()
The amount of dumped binaries leads to a huge influx of binaries which need to be executed in some kind of order.
Usually Kaldi does that by calling different types of bash scripts which are already shipped with the current version.

Debugging

So to debug the code, I have written, I tried to use eclipse as an IDE. Eclipse comes usually with the debugger gdb and can debug c++ code.
However, in the case of KALDI, we do not just start one c++ file, where we can just put parameters into, we need to call some functions in before. 
To configure this behaviour, there are some options:
  • Debug the shell script, by using gdbserver. Command: gdbserver localhost:1234 <file>
  • In eclipse "rewrite" the scripts again, by using a launch configuration, which includes all commands which were used inside the scripts ( does take a lot of time )
  • Cherry pick just the part of code, where the class, which is going to be debugged, is called and hope everything works fine ( the fast way )
Finally I picked the last option.



Wednesday, April 16, 2014

Tutorial : Training a UBM Model and Extract I-Vector with Alize 3.0

To train a Model and get familiar with I-Vector behavior I did train my model with the already available Alize 3.0 framework.

Alize provides every necessary binary file to do the training and extracting after compilation.
Also HTK can to be downloaded if, the precompiled binaries of Alize are not sufficient enough or simply won't work due to other architecture.

Also the config files of this Tutorial are needed, since I strongly oriented myself on that stuff there, but yet did automatize the whole process. I will point out the 'pain in the ass' things of the configurations.

After downloading and compiling, create a new directory where you like to estimate a model in (e.g. model).

execute the following commands in the console:

mkdir lib

mkdir lib/scp

mkdir lst

mkdir data

mkdir data/prm

mkdir data/lbl


After that we need to link the Alize binaries with this directory, so just use a softlink and execute :
ln -s <your path to Alize>/bin .

cp -r {YourIVectorDir} cfg
To begin the process it is necessary to generate ( if not already done ) a .scp file, which maps the given raw files ( either .wav or .sph ) files to the corresponding feature file splits ( plp, mfcc, or what you like ).

The format of this file is e.g. :

/dnn_data/8h-plp39-z/jaat-B_10065_10348.plp
where the last part signalizes the length of the speech utterance in that speakers full speech.

The following scripts are provided by me, firstly I wrote a script which generates a data.lst file ( simply cuts the given corpus filenames ),  concatenates the speech utterances and outputs the concatinated files into a directory.

Download these Scripts and run the following commands:



python generateUBM.py -i {path to .scp} -o data/prm/ -glst lst/data.lst
and after :
python TrainWorld_TrainTV_Train_IV.py
It is essential that all the config files in the cfg directory, which were downloaded from the tutorial, are fully copied to the working directory. The main error which probably will happen is the outofmemory exception. This is probably due to the configuration, if the mode is set to HTK, but the "useBigEndian" Parameter is still set on false, Alize can't read an EOF and therefore will allocate all memory available. Just check if the parameters are set to HTK for type and "usebigEndian" should be true. After the scripts ran, you should have a directory which is called "iv", where the ivectors are extracted to.

Tuesday, April 15, 2014

Feature extraction and normalization

I started to do some experiments on extracting the I-Vector. Since I will probably implement this extraction by myself, it is essential for me to have comparable results.

To understand how this procedure works, I illustrated a graph.
Of course to begin with speaker recognition and everything which includes that, it is necessary to get data.
Where do you get data from? Well the usual corpi do provide most data I used in my experiments. Mostly I consider NIST Corpus and Switchboard as a good source of already processed data.

This data comes mostly in form of speech wave files, either '.sph' or '.wav'.

These files are the raw data, so e.g. the recorded audio with a microphone or telephone etc. ( Switchboard and NIST use different methods ).

These files must be analyzed with the HTK toolkits program, called HCopy. HCopy can be used for many applications, where my application is the extraction of features out of the raw speech.
To achieve this, different extraction techniques can be used, namely MFCC ( Mel Frequency Cepstrum .. something ) and PLP ( I am not good at acronyms ). Both use different Frame sizes and frame banks.

So the question is, why do we actually extract features? Why don't we just use the raw input files?

Well the answer is quite simple: We just don't have enough resources. Overall the estimation process and the parameters are gigantic ( even if we reduce the parameters via feature extraction ). Overall we aim for reduction of the parameters,but in the same time try to maximize the amount of data we can include into our model.

After finishing the extraction, there is a problem, namely that our extracted features also include silence. When recording the data in a closed environment, we can probably verify that the resulting data won't have any kind of background noise.
Moreover there will always be silence in the recordings, which essentially doesn't hurt when calculating the cepstral log powers, since it wont affect them, yet the data size will increase, because every frame will result in a N-dimensional feature vector.
So we can reduce the amount of vectors to exclude the silence, or to put it differently, include just speech for our features.


After we removed the silence and have pure-speech feature files ( illustrated as new .plp ), we can normalize the means and the variances of our features. This is simply a re scaling of all vector factors so that the coefficients do not differ greatly from each other ( for latter use ). Experiments have shown that this normalization improves the performance.
After normalization of both mean and variances have finished, we get our input feature vectors and can process to train our model.



Tuesday, April 1, 2014

HTK..again

So after now having read quite some papers about I-Vector, Eigenvoices, Cluster Adaptive Training and UBM , I finally started to do something at least.


I am training a UBM Model with HTK right now, I did that already for a different task last semester, but this time it's simpler ( UBM has just 1 HMM state, yet massive amount of Gaussians ).
Naturally this model should be initialized with every Gaussian mixture, but experiments show that an increase in mixtures during the training actually improve performance and reduce the training time.



There already did exist a script (of Phil Woodland), which does the job of initializing the first prototypes and parameters. Furthermore it can estimate a whole model by re-estimation during constant increase of mixture components.
To speed up the process ( and hopefully do not loose too much accuracy ) I need to implement a different step size.
The usual step size is 4, so the amount of mixtures go in the sequence of : 1,4,8,12....MIXTURES, where MIXTURES is the maximum amount of Mixtures ( in my case 512/1024).

I changed the step size to the following sequence: 1,4,8,12,16,32,48,64,128,256...MIXTURES. So I still do begin with little mixture counts at the beginning and double them when the count reaches 16 ( 48 is an empirical exception ).


Thursday, March 27, 2014

CAT approach

CAT

Intoduction

After reading the basic paper about Cluster Adaptive Training ( CAT ) , I got a good Idea about the I-Vector approach and the statistics behind it.

CAT did provide a method to reduce substantially the amount of Parameters for training, but increased the performance and accuracy of the system.

CAT does rely on GMM-HMM adaptation , and clusters similar speakers together, but ties the variances and prior probabilities together hence only the means vary between clusters.
component priors and variances are tied over all the speaker clusters. 

Definition

The challenge is to calculate the mean of the speakers , given a Gaussian component.


This results in the following Model:
To estimate these two unknown variables, EM is used. The Estimation formula is defined as:

Tuesday, March 18, 2014

Doing some progress :(

After some time, now I got into the basic structure of how to get along with the topic.

I talked with some guys which share some point of interest in speech science, but work on robust speech recognition systems. 
One of them, which already works 4 years in the field said that my approach is basically impossible to detect a speaker within less then 3 seconds of speech.

I was not shocked, because it seems for me also quite impossible to detect a speaker in a marginal time without in a text-independent environment.

But I looked for some papers and found one which wasn't recommended by the colleagues, which is A study on Universal Background Model training in Speaker Verification . This paper digs deep into GMM-UBM Models, which are the baseline of my implementation.
They show that UBM Models can achieve state-of-the-art performance without being fed by enormous amount of data. They achieve an error rate of about 11% with an input length of 2,7 seconds per speaker, which is already good.

But it doesnt apply for my case, since they didnt use the "total variability space" or "i-vector extraction", which probably will speed up the process and boost the computational performance.

So basically I get my input signal or speech from NIST SRE Corpus (2005 - 2008).  I process the signal into a (probably) vector of length 39 and log energy with MFCC. Finally I need to generate my model out of the generated vectors.
Cambridge's HTK is used for the UBM, which models every vector into a gaussian in a GMM, which itself is then just a one state HMM. Multiple HMM's are not necessary since the "i-vectors" try to map every information of the input into one space, not into multiple independent spaces.

When I am finished estimating my UBM Model, I can begin to use the output model to estimate my I-vector parameters (If i am correct so far).




Wednesday, March 12, 2014

Beginning of Thesis

So it's on, shit is on fire, I began my Bachelor's Thesis. I started this blog just in case I need to share some results or just publish any kind of problems that did occur doing its conduction.

Topic is "short-time speaker verification with deep learning", which is in the Field of Human-Machine Interaction or to be more precise, Speech Recognition.

Mainly it evolves around to introduce several algorithms into Cambridge's HTK framework.