Thursday, March 27, 2014

CAT approach

CAT

Intoduction

After reading the basic paper about Cluster Adaptive Training ( CAT ) , I got a good Idea about the I-Vector approach and the statistics behind it.

CAT did provide a method to reduce substantially the amount of Parameters for training, but increased the performance and accuracy of the system.

CAT does rely on GMM-HMM adaptation , and clusters similar speakers together, but ties the variances and prior probabilities together hence only the means vary between clusters.
component priors and variances are tied over all the speaker clusters. 

Definition

The challenge is to calculate the mean of the speakers , given a Gaussian component.


This results in the following Model:
To estimate these two unknown variables, EM is used. The Estimation formula is defined as:

Tuesday, March 18, 2014

Doing some progress :(

After some time, now I got into the basic structure of how to get along with the topic.

I talked with some guys which share some point of interest in speech science, but work on robust speech recognition systems. 
One of them, which already works 4 years in the field said that my approach is basically impossible to detect a speaker within less then 3 seconds of speech.

I was not shocked, because it seems for me also quite impossible to detect a speaker in a marginal time without in a text-independent environment.

But I looked for some papers and found one which wasn't recommended by the colleagues, which is A study on Universal Background Model training in Speaker Verification . This paper digs deep into GMM-UBM Models, which are the baseline of my implementation.
They show that UBM Models can achieve state-of-the-art performance without being fed by enormous amount of data. They achieve an error rate of about 11% with an input length of 2,7 seconds per speaker, which is already good.

But it doesnt apply for my case, since they didnt use the "total variability space" or "i-vector extraction", which probably will speed up the process and boost the computational performance.

So basically I get my input signal or speech from NIST SRE Corpus (2005 - 2008).  I process the signal into a (probably) vector of length 39 and log energy with MFCC. Finally I need to generate my model out of the generated vectors.
Cambridge's HTK is used for the UBM, which models every vector into a gaussian in a GMM, which itself is then just a one state HMM. Multiple HMM's are not necessary since the "i-vectors" try to map every information of the input into one space, not into multiple independent spaces.

When I am finished estimating my UBM Model, I can begin to use the output model to estimate my I-vector parameters (If i am correct so far).




Wednesday, March 12, 2014

Beginning of Thesis

So it's on, shit is on fire, I began my Bachelor's Thesis. I started this blog just in case I need to share some results or just publish any kind of problems that did occur doing its conduction.

Topic is "short-time speaker verification with deep learning", which is in the Field of Human-Machine Interaction or to be more precise, Speech Recognition.

Mainly it evolves around to introduce several algorithms into Cambridge's HTK framework.