icefall
icefall copied to clipboard
Add convrnnt.py
Add ConvRNN-T Encoder ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition https://arxiv.org/pdf/2209.14868.pdf
model size: 44M
The best WER on LibriSpeech 960h within 20 epoch is: epoch-20 avg-4 modified_beam_search beam-size-4 use-averaged-model
-- | test-clean | test-other |
---|---|---|
WER | 5.01 | 11.92 |
Model Clean Other Size (M) RNN-T 5.9 15.71 30 Conformer 5.7 14.24 29 ContextNet 6.02 14.42 28 ConvRNN-T 5.11 13.82 29
The WER shown in the paper seems a lot worse than the original papers of conformer/contextnet etc. Any idea why is that?
Model Clean Other Size (M) RNN-T 5.9 15.71 30 Conformer 5.7 14.24 29 ContextNet 6.02 14.42 28 ConvRNN-T 5.11 13.82 29
The WER shown in the paper seems a lot worse than the original papers of conformer/contextnet etc. Any idea why is that?
I can't reproduce Google's setup. So, I have no no idea.