icefall [Streaming] Conv emformer right context length

Hi, I'm noticing some weird behaviour when altering the right context length with the conv emformer. Basically, increasing the right context length, while keeping chunk size constant, results in higher loss and worse WER. I can't understand why increasing future context would degrade performance? I notice that in your experiments the right context length is always 1/4 of the chunk size. Is there a reason for this i.e. does the right context length need to be fixed to a proportion of the chunk size? Many thanks in advance

Oct 13 '22 10:10 bethant9

For the ConvEmformer model, the chunk-length and right-context-length are both fixed during training. It is possible that we would get worse results if we use different chunk-length or right-context-length during decoding.

Oct 13 '22 10:10 yaozengwei

Sorry, I should have clarified. I am training the model with a larger right context length (and then decoding with the same values)

Oct 13 '22 10:10 bethant9

Could you show the number you set for both experiments?

Oct 13 '22 10:10 yaozengwei

32 chunk size + 8 right context length (default) vs 32 chunk size + 32 right context length The second experiment gives worse results so far

Oct 13 '22 10:10 bethant9

We have not tried such configuration of equal chunk-length and right-context-length. Maybe you could try chunk-length=32 and right-context-length=12 or 16, to see whether it would get improvement? You could also see the Emformer paper https://arxiv.org/pdf/2010.10759.pdf for details.

Oct 13 '22 10:10 yaozengwei

I tried with right context length (RCL) = 16 - it was better than RCL=32, but still worse than RCL=8. It seems from my rough experiments that increasing RCL degrades performance

Oct 13 '22 11:10 bethant9

Just wondering, is the training input padded with the right_context_length?

Oct 13 '22 16:10 bethant9

Ok. Our experiments on streaming conformer trained with dynamic chunk size also shows that increasing the right context could not consistently get improvements during decoding.

Oct 14 '22 03:10 yaozengwei

@bethant9 hi，can you upload wer files of different right context model？or can you tell the error types（S、D、I）of different models？

Oct 28 '22 12:10 kobenaxie

@bethant9 Can you try to increase the the tail padding length in decode.py? https://github.com/k2-fsa/icefall/blob/6709bf1e6325166fcb989b1dbb03344d6b90b7f8/egs/librispeech/ASR/conv_emformer_transducer_stateless2/decode.py#L280 Maybe we need a larger tail padding length for the model with a larger right context length, to avoid lossing the tail features.

Oct 28 '22 13:10 yaozengwei

Hi, I found that I needed to pad the training data with right context length frames - otherwise during training right context length frames are removed from the input which leads to incorrect training and higher WER

Oct 28 '22 13:10 bethant9

if not padding during training, is the model with longer right context leads to high Deletion Error ,especially at the end of sentence ?

Oct 28 '22 14:10 kobenaxie

Yes exactly, ends of utterances aren't correctly trained so high deletion error

Oct 28 '22 14:10 bethant9

I solved this by padding with right context length just before the emformer

Oct 28 '22 14:10 bethant9

Congratulations! Today I also find that emformer will drop the last chunk, which may lead to deletion error at the end. Thanks for your reply ~

Oct 28 '22 14:10 kobenaxie

Hi, I found that I needed to pad the training data with right context length frames - otherwise during training right context length frames are removed from the input which leads to incorrect training and higher WER

Great!

Oct 28 '22 14:10 yaozengwei

@bethant9 Hi, do you have plan to open PR to fix it ?

Oct 31 '22 02:10 kobenaxie

No, happy for you to do that if you want?

Oct 31 '22 09:10 bethant9

ok, I will take a try, thanks ~

Oct 31 '22 09:10 kobenaxie

icefall icefall copied to clipboard

[Streaming] Conv emformer right context length

icefall
icefall copied to clipboard