eesen Importance of utterance lengths

The utterances in the TEDLIUM dataset roughly range from 8 to 15 seconds.

I have a dataset with shorter utterances, ~5 to 10 seconds long.

What are the optimal and minimum lengths of utterances for RNN-CTC?

Nov 25 '17 10:11 ericbolo

Related question: I know CMVN (cepstral mean and variance normalization) can suffer from short utterances. In my current dataset I have only one locution per speaker.

Has anyone trained on a similar dataset (short utterances, one utterance per speaker)?

Thanks all !

Nov 25 '17 20:11 ericbolo

Yes, cmvn can be sensitive to short utterances. You may want to smooth utterances, or have a sliding window - if your data supports that.

We did some experiments with using power (signal energy) to determine where to compute the CMVN statistics on in the lorelei branch (new files in the featbin directory), but they were ultimately inconclusive. The process is to get an alignment (using Kaldi in this case), and compute the CMVN on the non-silence frames only, then apply it to all frames. Alternatively, you can fake the alignment using power (signal energy) only, or with some other criterion, and determine the non-silence frames with them only. The purpose of this is to make the CMVN calculation independent of the actual segmentation, which may be arbitrary.

Let me know if this works for you, we’d be interested in an update as well.

On Nov 25, 2017, at 3:26 PM, ericbolo [email protected] wrote:

Related question: I know CMVN can suffer from short utterances. In my current dataset I have only one locution per speaker.

Has anyone trained on a similar dataset (short utterances, one utterance per speaker)?

Thanks all !

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/156#issuecomment-346964001, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8XawbOH-8ipwCuhrMvPvyIUFj6S4ks5s6HfsgaJpZM4QqYfc.

Florian Metze http://www.cs.cmu.edu/directory/florian-metze Associate Research Professor Carnegie Mellon University

Nov 26 '17 03:11 fmetze

Thank you, @fmetze !

This Kaldi module applies a sliding window for CMVN computation: http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html

However, I don't understand the advantage of sliding windows. Is it simply a kind of data augmentation?

As for running CMVN on voiced frames only, I could try using a few voice activity detection algorithms I have at hand.

I will first run the experiment with the simple CMVN, then try these optimizations if needed. In any case, I'll keep you appraised, thanks again for your prompt and detailed answer!

Nov 26 '17 08:11 ericbolo

The sliding window should typically be a few seconds long, not? Then it just computes some local context and assumes that the speaker characteristics don’t change quickly. For talks or telephony speech, this is certainly true. For Meetings, it may be less true. Keep me posted - I’ve always wanted to look into this, too.

On Nov 26, 2017, at 3:09 AM, ericbolo [email protected] wrote:

Thank you, @fmetze https://github.com/fmetze !

This Kaldi module applies a sliding window for CMVN computation: http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html However, I don't understand the advantage of sliding windows. Is it simply a kind of data augmentation?

As for running CMVN on voiced frames only, I could try using a few voice activity detection algorithms I have at hand.

I will first run the experiment with the simple CMVN, then try these optimizations if needed. In any case, I'll keep you appraised, thanks again for your prompt and detailed answer!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/156#issuecomment-346991623, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8SW1tdZk03oPWD0aJca4mNk-Afvpks5s6RzGgaJpZM4QqYfc.

Nov 28 '17 22:11 fmetze

A quick update: with regular CMVN, no sliding window, the phonetic model reaches 79% token accuracy. So the model learns fairly well in spite of there being short utterances, and only one utterance per speaker.

(edit: to be more precise, reaches 90% train token accuracy, and 79% on cross-validation set)

Dec 06 '17 12:12 ericbolo