wenet Trained checkpoint produces empty result during runtime decoding

I have a trained Librispeech-8khz wenet model that produces decent outputs when I load my saved pytorch checkpoint and use the ctc_greedy_search() method to do offline decoding. I've exported that saved checkpoint to jit using the provided script and am now trying to follow the websocket online decoding demo. I successfully built and ran both the offline demo and the command-line streaming server-client demo on a sample file (LibriSpeech/test-other/3080/5032/3080-5032-0016.wav). Both demos give back empty final decoding results.

I ran the offline demo with ./build/decoder_main --chunk_size -1 --wav_path $wav_path --model_path $model_dir/checkpoint.jit --dict_path $model_dir/words.txt --sample_rate 8000 2>&1 | tee log.txt. I've attached my config file and my dictionary file. Any thoughts on what's happening here?

config_dict.zip

May 03 '21 19:05 stevenhillis

Have you updated the code to the latest version? There is a bug before, and we just fix it, please try the latest code.

May 06 '21 02:05 robin1001

Yeah, I'm even with origin/main at 017af3a (pulled and rebuilt). I'm not seeing any partial results either.

May 06 '21 14:05 stevenhillis

What's the outputs when you run the offline demo? Please attach the log.txt.

May 07 '21 02:05 pengzhendong

Here's my log! offline_output.txt

May 07 '21 14:05 stevenhillis

Does the model before exported work well on LibriSpeech/test-other/3080/5032/3080-5032-0016.wav? It looks like that all frames are predicted to <blank>. There are two ways may help you.

check whether the dict words.txt is same.
check whether the dither param is same in fbank.h (8k audio may be sensitive to the dither param).

May 08 '21 03:05 pengzhendong

The model works well enough before and after export on the sample file; here's the output from my python reimplementation of the offline demo using the same exported checkpoint: offline_output_python_reimpl.txt. (Please ignore the timings, I had a train running on the same machine)

words.txt does match between train and test, with <blank> at index 0 in both cases.

I believe the dither param also matches. I have it set to 0.0 for training, and that looks like the default in fbank.h as well.

Is there anything else around frontend feature extraction you could see leading to an all-blank output?

May 10 '21 16:05 stevenhillis

maye some audios have 2 channels, u can refer to https://github.com/wenet-e2e/wenet/issues/519

Feb 21 '23 05:02 xingchensong