I talked with some guys which share some point of interest in speech science, but work on robust speech recognition systems.
One of them, which already works 4 years in the field said that my approach is basically impossible to detect a speaker within less then 3 seconds of speech.
I was not shocked, because it seems for me also quite impossible to detect a speaker in a marginal time without in a text-independent environment.
But I looked for some papers and found one which wasn't recommended by the colleagues, which is A study on Universal Background Model training in Speaker Verification . This paper digs deep into GMM-UBM Models, which are the baseline of my implementation.
They show that UBM Models can achieve state-of-the-art performance without being fed by enormous amount of data. They achieve an error rate of about 11% with an input length of 2,7 seconds per speaker, which is already good.
But it doesnt apply for my case, since they didnt use the "total variability space" or "i-vector extraction", which probably will speed up the process and boost the computational performance.
So basically I get my input signal or speech from NIST SRE Corpus (2005 - 2008). I process the signal into a (probably) vector of length 39 and log energy with MFCC. Finally I need to generate my model out of the generated vectors.
Cambridge's HTK is used for the UBM, which models every vector into a gaussian in a GMM, which itself is then just a one state HMM. Multiple HMM's are not necessary since the "i-vectors" try to map every information of the input into one space, not into multiple independent spaces.
When I am finished estimating my UBM Model, I can begin to use the output model to estimate my I-vector parameters (If i am correct so far).
No comments:
Post a Comment