Trained checkpoint produces empty result during runtime decoding
I have a trained Librispeech-8khz wenet model that produces decent outputs when I load my saved pytorch checkpoint and use the ctc_greedy_search() method to do offline decoding. I've exported that saved checkpoint to jit using the provided script and am now trying to follow the websocket online decoding demo. I successfully built and ran both the offline demo and the command-line streaming server-client demo on a sample file (LibriSpeech/test-other/3080/5032/3080-5032-0016.wav). Both demos give back empty final decoding results.
I ran the offline demo with ./build/decoder_main --chunk_size -1 --wav_path $wav_path --model_path $model_dir/checkpoint.jit --dict_path $model_dir/words.txt --sample_rate 8000 2>&1 | tee log.txt. I've attached my config file and my dictionary file. Any thoughts on what's happening here?
Have you updated the code to the latest version? There is a bug before, and we just fix it, please try the latest code.
Yeah, I'm even with origin/main at 017af3a (pulled and rebuilt). I'm not seeing any partial results either.
What's the outputs when you run the offline demo? Please attach the log.txt.
Here's my log! offline_output.txt
Does the model before exported work well on LibriSpeech/test-other/3080/5032/3080-5032-0016.wav?
It looks like that all frames are predicted to <blank>.
There are two ways may help you.
- check whether the dict
words.txtis same. - check whether the
ditherparam is same infbank.h(8k audio may be sensitive to the dither param).
The model works well enough before and after export on the sample file; here's the output from my python reimplementation of the offline demo using the same exported checkpoint: offline_output_python_reimpl.txt. (Please ignore the timings, I had a train running on the same machine)
words.txt does match between train and test, with <blank> at index 0 in both cases.
I believe the dither param also matches. I have it set to 0.0 for training, and that looks like the default in fbank.h as well.
Is there anything else around frontend feature extraction you could see leading to an all-blank output?
maye some audios have 2 channels, u can refer to https://github.com/wenet-e2e/wenet/issues/519