icefall
icefall copied to clipboard
[Streaming] Conv emformer right context length
Hi, I'm noticing some weird behaviour when altering the right context length with the conv emformer. Basically, increasing the right context length, while keeping chunk size constant, results in higher loss and worse WER. I can't understand why increasing future context would degrade performance? I notice that in your experiments the right context length is always 1/4 of the chunk size. Is there a reason for this i.e. does the right context length need to be fixed to a proportion of the chunk size? Many thanks in advance
For the ConvEmformer model, the chunk-length and right-context-length are both fixed during training. It is possible that we would get worse results if we use different chunk-length or right-context-length during decoding.
Sorry, I should have clarified. I am training the model with a larger right context length (and then decoding with the same values)
Could you show the number you set for both experiments?
32 chunk size + 8 right context length (default) vs 32 chunk size + 32 right context length The second experiment gives worse results so far
We have not tried such configuration of equal chunk-length and right-context-length. Maybe you could try chunk-length=32 and right-context-length=12 or 16, to see whether it would get improvement? You could also see the Emformer paper https://arxiv.org/pdf/2010.10759.pdf for details.
I tried with right context length (RCL) = 16 - it was better than RCL=32, but still worse than RCL=8. It seems from my rough experiments that increasing RCL degrades performance
Just wondering, is the training input padded with the right_context_length?
Ok. Our experiments on streaming conformer trained with dynamic chunk size also shows that increasing the right context could not consistently get improvements during decoding.
@bethant9 hi,can you upload wer files of different right context model?or can you tell the error types(S、D、I)of different models?
@bethant9 Can you try to increase the the tail padding length in decode.py? https://github.com/k2-fsa/icefall/blob/6709bf1e6325166fcb989b1dbb03344d6b90b7f8/egs/librispeech/ASR/conv_emformer_transducer_stateless2/decode.py#L280 Maybe we need a larger tail padding length for the model with a larger right context length, to avoid lossing the tail features.
Hi, I found that I needed to pad the training data with right context length frames - otherwise during training right context length frames are removed from the input which leads to incorrect training and higher WER
if not padding during training, is the model with longer right context leads to high Deletion Error ,especially at the end of sentence ?
Yes exactly, ends of utterances aren't correctly trained so high deletion error
I solved this by padding with right context length just before the emformer
Congratulations! Today I also find that emformer will drop the last chunk, which may lead to deletion error at the end. Thanks for your reply ~
Hi, I found that I needed to pad the training data with right context length frames - otherwise during training right context length frames are removed from the input which leads to incorrect training and higher WER
Great!
@bethant9 Hi, do you have plan to open PR to fix it ?
No, happy for you to do that if you want?
ok, I will take a try, thanks ~