NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

How to use a pre-trained model for cache-aware FastConformer-Hybrid model?

Open sangeet2020 opened this issue 1 year ago • 3 comments

Hi @titu1994,

Following our discussion in this thread, I’m training a cache-aware FastConformer hybrid CTC-RNNT model for German using 1.2K hours of audio data. Despite training for 150 epochs, my validation WER is still around 0.28.

I suspect the dataset quality might be an issue. I reviewed the paper "Stateful Conformer with Cache-Based Inference for Streaming ASR" and noted the significant performance achieved even with training from scratch on LibriSpeech.

Since you recommended using a pre-trained model, I tried using this model from Hugging Face, but it's not a streaming model. Is it still viable as a pre-trained model for my use case, or are there other German models available that you would recommend?

Thank you for your guidance!

sangeet2020 avatar Jun 19 '24 08:06 sangeet2020

You can use this model, which is a chunk aware model - https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi

titu1994 avatar Jun 19 '24 18:06 titu1994

Thank You @titu1994. I will try it. But this has been trained on labeled English dataset. I want to understand the logic, how would it adapt to any other language?

sangeet2020 avatar Jun 21 '24 12:06 sangeet2020

It's a practical limitation. You can either get a ordinary fast conformer in German or a chunk aware Conformer in English. Depends on what your priority is - streaming or transcript accuracy. We have tutorial showing language transfer

titu1994 avatar Jun 21 '24 16:06 titu1994