icefall
icefall copied to clipboard
New dataset preparing
Hi authors,
Very amazing work! Is there any scripts to prepare a new dataset that I can train or test? Or to simplify, how can I test the wer of new dataset using the pretrained model on Librispeech? Thanks so much.
We are using lhotse for data preparation. In order to test the WER of your own test dataset, you have to use lhotse to prepare your data.
If you have already had a kaldi data dir, then it is fairly easy to convert it to lhotse format by following https://lhotse.readthedocs.io/en/latest/kaldi.html#cli (One thing to note is that you have to re-extract the features using lhotse if you want to use a pre-trained model from icefall since models in icefall are trained using features extracted from normalized audio samples, i.e., samples in the range [-1, 1], while kaldi uses un-normalized audio samples)
If you don't have a kaldi data dir, then you can follow any recipe in https://github.com/lhotse-speech/lhotse/tree/master/lhotse/recipes to add your own dataset. After that, please have a look at https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/prepare.sh for how to use your own dataset.
When you are able to generate a cuts.jsonl.gz from you own dataset, you can follow
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py
to write a dataloader for your dataset.
To decode your dataset with a pretrained model, please have a look at https://github.com/k2-fsa/icefall/blob/f3ad32777a598de6169274198e418ee95fbf6ddc/egs/librispeech/ASR/pruned_transducer_stateless2/decode.py#L786-L793