icefall icon indicating copy to clipboard operation
icefall copied to clipboard

New dataset preparing

Open WangHelin1997 opened this issue 3 years ago • 1 comments
trafficstars

Hi authors,

Very amazing work! Is there any scripts to prepare a new dataset that I can train or test? Or to simplify, how can I test the wer of new dataset using the pretrained model on Librispeech? Thanks so much.

WangHelin1997 avatar Sep 30 '22 17:09 WangHelin1997

We are using lhotse for data preparation. In order to test the WER of your own test dataset, you have to use lhotse to prepare your data.

If you have already had a kaldi data dir, then it is fairly easy to convert it to lhotse format by following https://lhotse.readthedocs.io/en/latest/kaldi.html#cli (One thing to note is that you have to re-extract the features using lhotse if you want to use a pre-trained model from icefall since models in icefall are trained using features extracted from normalized audio samples, i.e., samples in the range [-1, 1], while kaldi uses un-normalized audio samples)

If you don't have a kaldi data dir, then you can follow any recipe in https://github.com/lhotse-speech/lhotse/tree/master/lhotse/recipes to add your own dataset. After that, please have a look at https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/prepare.sh for how to use your own dataset.

When you are able to generate a cuts.jsonl.gz from you own dataset, you can follow https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py to write a dataloader for your dataset.

To decode your dataset with a pretrained model, please have a look at https://github.com/k2-fsa/icefall/blob/f3ad32777a598de6169274198e418ee95fbf6ddc/egs/librispeech/ASR/pruned_transducer_stateless2/decode.py#L786-L793

csukuangfj avatar Oct 01 '22 01:10 csukuangfj