wav2letter Inference failed with long audio

Bug Description

When i inference by any type of inference (simple, multithread or interactive) with a long audio (30 minutes), inference work fine in 15 first minutes after that output is empty: 773000,774000,làm gì 774000,775000, 775000,776000,nhà tao biệt 776000,777000,lập ở xóm trại 777000,778000,đây 778000,779000,bao năm chẳng 779000,780000,làm ruộng nên trắng 780000,781000,biết làm gì 781000,782000,ngoài 782000,783000,làm cái gì cá ba 783000,784000, 784000,785000,xã 785000,786000,không bắt sao 786000,787000, 787000,788000,gì 788000,789000,nhưng bây 789000,790000,giờ dân ít 790000,791000,chơi rồi 791000,792000,chỉ thỉnh thoảng 792000,793000,vào dịch lễ 793000,794000,tết thôi 794000,795000,nên cũng 795000,796000,chẳng 796000,797000, 797000,798000, 798000,799000, 799000,800000, 800000,801000, 801000,802000, 802000,803000, 803000,804000, 804000,805000, 805000,806000, 806000,807000, 807000,808000, 808000,809000, 809000,810000, 810000,811000, 811000,812000, 812000,813000, 813000,814000, 814000,815000, 815000,816000, 816000,817000, 817000,818000, 818000,819000, 819000,820000, 820000,821000, 821000,822000, 822000,823000, 823000,824000, 824000,825000, 825000,826000, 826000,827000, 827000,828000, 828000,829000, 829000,830000, 830000,831000, 831000,832000, 832000,833000, 833000,834000, 834000,835000, 835000,836000, 836000,837000, 837000,838000, 838000,839000, 839000,840000, 840000,841000, 841000,842000, 842000,843000, 843000,844000, 844000,845000,

Anyone have same problems, and how to fix it? Thank you

Dec 16 '20 02:12 hieuhv94

@hieuhv94: I had the same problem a while ago and I am not quite sure how I fixed it (or whether I fixed it), but are you reading the file while it is being written to (although even this shouldn't cause any problems)?

Dec 16 '20 04:12 abhinavkulkarni

@abhinavkulkarni Thank for your reply but i sure that i don't read file while it is being written, i recorded it before decoding. If you remember how you fix it, please tell me, or I'll try fix it by myself :))) Thanks!

Dec 16 '20 04:12 hieuhv94

@hieuhv94: Sorry, I meant the transcription file (rather than the audio file). But yeah, that's unlikely to be the cause behind the missing transcription.

Can you please verify that your audio is either wave or flac format, 16kz frequency, 16-bit depth integers and monochannel?

Thanks.

Dec 16 '20 04:12 abhinavkulkarni

@hieuhv94: Sorry, I meant the transcription file (rather than the audio file). But yeah, that's unlikely to be the cause behind the missing transcription.

Can you please verify that your audio is either wave or flac format, 16kz frequency, 16-bit depth integers and monochannel?

Thanks.

Of course, parameters of wave file: Sample rate: 16kHz Bitrate: 256kbps => 16-bit depth intergers Channels: mono

Dec 16 '20 04:12 hieuhv94

@hieuhv94: Sorry, I meant the transcription file (rather than the audio file). But yeah, that's unlikely to be the cause behind the missing transcription.

Can you please verify that your audio is either wave or flac format, 16kz frequency, 16-bit depth integers and monochannel?

Thanks.

And i print output to console not a transcripts file

Dec 16 '20 04:12 hieuhv94

cc @vineelpratap @xuqiantong

Dec 16 '20 07:12 tlikhomanenko

@vineelpratap , @xuqiantong Have you any ideal?

Dec 17 '20 04:12 hieuhv94

hi all, does we have any update for this issues, I have the same problem? :(( I think problem from score beam when decode, because i try case with lmweight=0 and wordscore=0, streaming work normal with long audio, but when try set lmweight=0.7 and wordscore=0.8, streaming when to some chunk has no get any output. Any idea?

Dec 22 '20 03:12 mlexplore1122

hi hieuhv94, can you confirm same experiment with your audio?

Bug Description

When i inference by any type of inference (simple, multithread or interactive) with a long audio (30 minutes), inference work fine in 15 first minutes after that output is empty: 773000,774000,làm gì 774000,775000, 775000,776000,nhà tao biệt 776000,777000,lập ở xóm trại 777000,778000,đây 778000,779000,bao năm chẳng 779000,780000,làm ruộng nên trắng 780000,781000,biết làm gì 781000,782000,ngoài 782000,783000,làm cái gì cá ba 783000,784000, 784000,785000,xã 785000,786000,không bắt sao 786000,787000, 787000,788000,gì 788000,789000,nhưng bây 789000,790000,giờ dân ít 790000,791000,chơi rồi 791000,792000,chỉ thỉnh thoảng 792000,793000,vào dịch lễ 793000,794000,tết thôi 794000,795000,nên cũng 795000,796000,chẳng 796000,797000, 797000,798000, 798000,799000, 799000,800000, 800000,801000, 801000,802000, 802000,803000, 803000,804000, 804000,805000, 805000,806000, 806000,807000, 807000,808000, 808000,809000, 809000,810000, 810000,811000, 811000,812000, 812000,813000, 813000,814000, 814000,815000, 815000,816000, 816000,817000, 817000,818000, 818000,819000, 819000,820000, 820000,821000, 821000,822000, 822000,823000, 823000,824000, 824000,825000, 825000,826000, 826000,827000, 827000,828000, 828000,829000, 829000,830000, 830000,831000, 831000,832000, 832000,833000, 833000,834000, 834000,835000, 835000,836000, 836000,837000, 837000,838000, 838000,839000, 839000,840000, 840000,841000, 841000,842000, 842000,843000, 843000,844000, 844000,845000,

Anyone have same problems, and how to fix it? Thank you

Dec 22 '20 03:12 mlexplore1122

Hi @mlexplore1122 I tried the same experiemnt

hi hieuhv94, can you confirm same experiment with your audio?

Bug Description

When i inference by any type of inference (simple, multithread or interactive) with a long audio (30 minutes), inference work fine in 15 first minutes after that output is empty: 773000,774000,làm gì 774000,775000, 775000,776000,nhà tao biệt 776000,777000,lập ở xóm trại 777000,778000,đây 778000,779000,bao năm chẳng 779000,780000,làm ruộng nên trắng 780000,781000,biết làm gì 781000,782000,ngoài 782000,783000,làm cái gì cá ba 783000,784000, 784000,785000,xã 785000,786000,không bắt sao 786000,787000, 787000,788000,gì 788000,789000,nhưng bây 789000,790000,giờ dân ít 790000,791000,chơi rồi 791000,792000,chỉ thỉnh thoảng 792000,793000,vào dịch lễ 793000,794000,tết thôi 794000,795000,nên cũng 795000,796000,chẳng 796000,797000, 797000,798000, 798000,799000, 799000,800000, 800000,801000, 801000,802000, 802000,803000, 803000,804000, 804000,805000, 805000,806000, 806000,807000, 807000,808000, 808000,809000, 809000,810000, 810000,811000, 811000,812000, 812000,813000, 813000,814000, 814000,815000, 815000,816000, 816000,817000, 817000,818000, 818000,819000, 819000,820000, 820000,821000, 821000,822000, 822000,823000, 823000,824000, 824000,825000, 825000,826000, 826000,827000, 827000,828000, 828000,829000, 829000,830000, 830000,831000, 831000,832000, 832000,833000, 833000,834000, 834000,835000, 835000,836000, 836000,837000, 837000,838000, 838000,839000, 839000,840000, 840000,841000, 841000,842000, 842000,843000, 843000,844000, 844000,845000, Anyone have same problems, and how to fix it? Thank you

Hi @mlexplore1122, thank you for your ideal I tried this and results as you predicted, model worked normally with lmweight=0 and wordscore=0. Did u fix it?

Dec 22 '20 09:12 hieuhv94

Hi @mlexplore1122 I tried the same experiemnt

hi hieuhv94, can you confirm same experiment with your audio?

Bug Description

When i inference by any type of inference (simple, multithread or interactive) with a long audio (30 minutes), inference work fine in 15 first minutes after that output is empty: 773000,774000,làm gì 774000,775000, 775000,776000,nhà tao biệt 776000,777000,lập ở xóm trại 777000,778000,đây 778000,779000,bao năm chẳng 779000,780000,làm ruộng nên trắng 780000,781000,biết làm gì 781000,782000,ngoài 782000,783000,làm cái gì cá ba 783000,784000, 784000,785000,xã 785000,786000,không bắt sao 786000,787000, 787000,788000,gì 788000,789000,nhưng bây 789000,790000,giờ dân ít 790000,791000,chơi rồi 791000,792000,chỉ thỉnh thoảng 792000,793000,vào dịch lễ 793000,794000,tết thôi 794000,795000,nên cũng 795000,796000,chẳng 796000,797000, 797000,798000, 798000,799000, 799000,800000, 800000,801000, 801000,802000, 802000,803000, 803000,804000, 804000,805000, 805000,806000, 806000,807000, 807000,808000, 808000,809000, 809000,810000, 810000,811000, 811000,812000, 812000,813000, 813000,814000, 814000,815000, 815000,816000, 816000,817000, 817000,818000, 818000,819000, 819000,820000, 820000,821000, 821000,822000, 822000,823000, 823000,824000, 824000,825000, 825000,826000, 826000,827000, 827000,828000, 828000,829000, 829000,830000, 830000,831000, 831000,832000, 832000,833000, 833000,834000, 834000,835000, 835000,836000, 836000,837000, 837000,838000, 838000,839000, 839000,840000, 840000,841000, 841000,842000, 842000,843000, 843000,844000, 844000,845000, Anyone have same problems, and how to fix it? Thank you

Hi @mlexplore1122, thank you for your ideal I tried this and results as you predicted, model worked normally with lmweight=0 and wordscore=0. Did u fix it?

sorry but i have in process for debug more. because code flow in lexicon decoder is not ez to understand so i think i need more time for debug. If any facebook dev dive to their code with idea behind my hypothesis with score lm, i think it will faster to resolv. anyway, i will update when have any news. :)))

Dec 23 '20 02:12 mlexplore1122

hi all, I just test with decoder offline with LM is same with decoder in infrence streaming, so as same result, can't get full output of audio ( in this case LM is just 3 gram with prune and quantinize, this version is try to downsize of lm for streaming) When I change LM to version 4 gram with quantinize only (version official for decoder) output of long audio is good, all audio has decoded. So I think problem is from when you have LM is good enough, we can avoid problem (and problem from LexiconDecoder.cpp logic) like issuse #894 and this issuse not resolved too. So just wait @vineelpratap for his answer :((((, and i will continue trying .

Dec 24 '20 07:12 mlexplore1122

Hi, I think the problem could be that for very very long audios, we need to re-normalize the computed alphas (forward probabilities) . I'm looking into the best way to fix it. Will get back soon...

Jan 05 '21 00:01 vineelpratap

I cannot reproduce the issue.

I took a librispeech audio and replicated it 100 times to create a ~30 minute audio and used simple_streaming_asr_example and it transcribed everything correctly...

This is what I did...

> cd data 
> for f in acoustic_model.bin tds_streaming.arch decoder_options.json feature_extractor.bin language_model.bin lexicon.txt tokens.txt ; do wget http://dl.fbaipublicfiles.com/wav2letter/inference/examples/model/${f} ; done
> // consider a file from Librispeech audio.flac 
> sox audio.flac audio.wav // convert to .wav
> cp audio.wav longaudio.wav && for i in {1..100};do sox audio.wav longaudio.wav longaudio.wav; done
> ./$PATH/simple_streaming_asr_example -input_files_base_path data -input_audio_file longaudio.wav

If you can give a way for me to reproduce the issue, it will help me in debugging...

Jan 05 '21 09:01 vineelpratap

hi @vineelpratap as some comment above, problem just occurred when language model not enough good meaning fit with audio domain. So as this idea, i just use lm_3 gram with prune 0 5 6 in kenlm. And I try to find audio with big gap difference domain, but may be acoustic model train with unsupervised so good so i hard to find audio to make reproduce error ez. But finally i found this audio with domain from news about games can be reproduce error ( 1 hour and error occured from 42th minute to 51 minuute not all audio) like some experiment in my language. And of course when i cut audio from 42-51 alone and decode, everything work normaly. I have include my cut audio for your reproduce. Thankyou All resource for reprocedure in my drive: drive folder save audio and lm_3_gram_prune_056 for reproduce error

Jan 07 '21 08:01 trangtv57