wav2letter Higher Error Rate while Decoding

When I start decoding, the Predicted Value is way different from the True value, you can see that below:

root@1d325108be6b:/home# /root/wav2letter/build/Decoder --flagsfile decoder.cfg | tee logfile.txt

|T|: i wish to express my gratitude to kathy davis for her immense practical help
|P|: left then best new new it best best eighteen eighteen eighteen eighteen mother woman woman woman hard hard hard hard hard hard hard hard hard hard hard hard hard hard hard hard hard hard it it it it it it it it it it it hard hard wanted woman company game says says it it hard wanted wanted wanted company company company company says says done it it it woman wanted wanted cant cant cant company it hard hard hard hard hard
[sample: 00004150.wav, WER: 585.714%, LER: 709.211%, slice WER: 489.419%, slice LER: 590.291%, progress (slice 2): 27.0965%]
|T|: what is the humidity in dharmavaram
|P|: best best new new best best eighteen eighteen eighteen nor mother mother mother company woman woman woman woman company newness it it woman wanted wanted company company company says done it it it wanted cant cant company company it hard hard hard hard hard
[sample: 353.wav, WER: 733.333%, LER: 1160%, slice WER: 496.585%, slice LER: 603.658%, progress (slice 0): 27.3784%]
|T|: when given a list of music to prepare at his first meeting with sanger he didnt realise that it was a terms work and
|P|: left then best new few it best best best eighteen eighteen eighteen eighteen eighteen eighteen woman hard hard hard hard hard hard hard hard hard hard hard hard hard hard it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it hard hard wanted wanted woman company company game newness it it it hard wanted wanted wanted company cant company company says it it it it it hard wanted cant cant company company it hard hard hard hard hard
[sample: sample-014189.wav, WER: 408.333%, LER: 470.69%, slice WER: 496.52%, slice LER: 607.905%, progress (slice 1): 26.9086%]
|T|: waters philosophical lyrics rolling stone described pink floyd as purveyors of a distinctively dark vision
|P|: left then best new it best best best eighteen eighteen eighteen eighteen eighteen eighteen mother mother woman hard hard hard hard hard hard hard hard hard hard hard hard wanted woman company company says says it it it it hard woman wanted wanted wanted woman company company says done it it it wanted wanted cant cant cant company company it hard hard hard hard hard
[sample: sample-102577.wav, WER: 433.333%, LER: 450.943%, slice WER: 497.908%, slice LER: 603.061%, progress (slice 3): 26.9%]
|T|: and i must needs say that thou wilt not be called so mighty a man here as thou art at home if thou showest no greater prowess in other feats than methinks will be shown in this thor full of wrath again set
the horn to his lips and did his best to empty it but on looking in found the liquor was only a little lower
|P|: for for few best eighteen eighteen eighteen eighteen eighteen eighteen eighteen eighteen hard hard hard hard hard hard hard hard hard hard hard hard it it it it it it it it it it it it it it it it it it it
it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it hard hard hard woman cant cant
company company company company it it it it it hard woman wanted wanted wanted cant cant cant company company hard it it it hard wanted wanted cant cant cant cant company it hard hard hard hard hard
[sample: 826-131124-0029.wav, WER: 216.418%, LER: 230.255%, slice WER: 488.885%, slice LER: 589.66%, progress (slice 2): 27.1083%]

The following is the decoder configuration file decoder.cfg:

--am=/home/training/english_train/015_model_lists#dev.lst.bin
--maxload=-1
--test=lists/dev.lst
--nthread_decoder=4
--nthread_decoder_am_forward=2
--decodertype=wrd
--uselexicon=true
--tokensdir=/home/am/
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/home/am/librispeech-train+dev-unigram-10000-nbest10.lexicon
--show=true

The dev.lst has 34,055 files and the predictions remains the same for all although during training the LER and WER is around 30 as you can see on the top but decoding LER and WER is more than 200%, I wanted to know why is this happening and what can I do to mitigate it ?

Jun 18 '20 04:06 GowthamS07

Are you using s2s model or ctc?

Here is docs on decoder parameters (in which cases what you need to tune) https://github.com/facebookresearch/wav2letter/wiki/Beam-Search-Decoder. Seems you are using s2s, then try to set lmweight (from 0 to 3), eosscore (from -10 to 0) and attentionthreshold=30 and lm itself.

Jun 18 '20 05:06 tlikhomanenko

@tlikhomanenko Yes, we are using s2s model considering only AM, we are not using any LM for the current scenario so why to set lmweight ?? and can you also please verify if the criteria given in the above decoder.cfg is correct or not.

Jun 18 '20 06:06 GowthamS07

@GowthamS07,

So as I am understanding you want to have not just Viterbi path. Instead you want to have like zero lm decoding where you still restrict to the lexicon? Or do you want to have the same decoding (Viterbi path) as you compute during training?

Jun 18 '20 17:06 tlikhomanenko

Hi @tlikhomanenko . Me and Gowtham are working in same team. I have the same question as asked by him (mentioned above). I am explaining to you our doubts clearly.

After the AM is trained successfully, I am using three different approaches:

Approach#1: Greedy path (no LM part is involved) The corresponding configurations/parameters were used --am=/home/training1/english_train/011_model_lists#dev.lst.bin --tokensdir=/home/english_data/am1 --tokens=librispeech-train-all-unigram-10000.tokens --lexicon=/home/english_data/am1/librispeech-train+dev-unigram-10000-nbest10.lexicon --datadir=/home/english_data/ --test=lists1/dev.lst --sclite=/home/sclite1 --uselexicon=true --decodertype=wrd --beamsize=50 --beamsizetoken=10 --beamthreshold=10 --attentionthreshold=30 --smoothingtemperature=1 --nthread_decoder=1 --show --eosscore=-8.4672579451673 --maxload=2

The command I used: /root/wav2letter/build/Test --flagsfile /home/test_am_tds_s2s_.cfg Output:

|T|: s h e _ a l w a y s _ t u r n s _ t o _ h i m _ w h e n _ s h e _ i s _ i n _ r e a l _ t r o u b l e
|P|:
[sample: 279.wav, WER: 100%, LER: 100%, total WER: 100%, total LER: 100%, progress (thread 0): 50%]
|T|: n o n e _ o f _ t h i s _ h a _ d _ m a d e _ a n _ i m p r e s s i o n _ o n _ t h e _ b o y
|P|: t h e
[sample: sample-003828.wav, WER: 90.9091%, LER: 93.617%, total WER: 95.4545%, total LER: 96.9388%, progress (thread 0): 100%]

Now, I am using same parameters but I am changing the command to Decoder inspite of Test (Every flag remain same) The command I used: /root/wav2letter/build/Decoder --flagsfile /home/test_am_tds_s2s_.cfg Output:

|T|: she always turns to him when she is in real trouble
|P|: before this was always the first time last year before six thousand before six year before six thousand seven hundred thousand nine hundred ninety nine thousand nine hundred eight thousand nine hundred nine hundred eight thousand eight hundred nine hundred eighties in one thousand seven hundred and eight hundred eight hundred thousand
[sample: 279.wav, WER: 454.545%, LER: 574.51%, slice WER: 454.545%, slice LER: 574.51%, decoded samples (thread 0): 1]
|T|: none of this ha d made an impression on the boy
|P|: it didnt even occur to the boy to see that the boy was able to understand what he would
[sample: sample-003828.wav, WER: 154.545%, LER: 131.915%, slice WER: 304.545%, slice LER: 362.245%, decoded samples (thread 0): 2]

Question#1: What is the difference between Decoder and Test command while keeping every parameter same in the config file? (using a greedy path for both)

Approach#2: Beam-search decoding (with zeroLM) The corresponding configurations/parameters were used --am=/home/training1/english_train/012_model_lists#dev.lst.bin --tokensdir=/home/english_data/am1 --tokens=librispeech-train-all-unigram-10000.tokens --lexicon=/home/english_data/am1/librispeech-train+dev-unigram-10000-nbest10.lexicon --lm= --datadir=/home/english_data/ --test=lists1/dev.lst --sclite=/home/sclite1 --uselexicon=true --decodertype=wrd --beamsize=50 --beamsizetoken=10 --beamthreshold=10 --attentionthreshold=30 --smoothingtemperature=1 --nthread_decoder=1 --show #--showletters #--lmtype=kenlm #--lmweight=0.94468978683208 --eosscore=-8.4672579451673 --maxload=2

I am keeping empty for flag lm as per your suggestion https://github.com/facebookresearch/wav2letter/issues/750 , but it is providing following error:

terminate called after throwing an instance of 'util::ErrnoException'
  what():  /root/kenlm/util/file.cc:76 in int util::OpenReadOrThrow(const char*) threw ErrnoException because `-1 == (ret = open(name, 00))'.
No such file or directory while opening
*** Aborted at 1598968412 (unix time) try "date -d @1598968412" if you are using GNU date ***
PC: @     0x7f0726c1be97 gsignal
*** SIGABRT (@0xdc5) received by PID 3525 (TID 0x7f076c4eb380) from PID 3525; stack trace: ***
    @     0x7f0764801890 (unknown)
    @     0x7f0726c1be97 gsignal
    @     0x7f0726c1d801 abort
    @     0x7f0727610957 (unknown)
    @     0x7f0727616ab6 (unknown)
    @     0x7f0727616af1 std::terminate()
    @     0x7f0727616d24 __cxa_throw
    @     0x55ed151ce143 util::OpenReadOrThrow()
    @     0x55ed151ccd8e lm::ngram::RecognizeBinary()
    @     0x55ed1518e02c lm::ngram::LoadVirtual()
    @     0x55ed150b2d79 w2l::KenLM::KenLM()
    @     0x55ed14f65140 main
    @     0x7f0726bfeb97 __libc_start_main
    @     0x55ed14fc40ea _start
Aborted (core dumped)

Question#2: How to use zeroLM? Question#3: Will zeroLM provide same result to that of the Greedy path (where all flags are the same except inserting zeroLM flag)?

Approach#3: Beam-search decoding (with KenLM, lmweight) The corresponding configurations/parameters were used --am=/home/training1/english_train/011_model_lists#dev.lst.bin --tokensdir=/home/english_data/am1 --tokens=librispeech-train-all-unigram-10000.tokens --lexicon=/home/english_data/am1/librispeech-train+dev-unigram-10000-nbest10.lexicon --lm=/home/pre_trained_model/lm_librispeech_kenlm_4g.bin --datadir=/home/english_data/ --test=lists1/dev.lst --sclite=/home/sclite1 --uselexicon=true --decodertype=wrd --beamsize=50 --beamsizetoken=10 --beamthreshold=10 --attentionthreshold=30 --smoothingtemperature=1 --nthread_decoder=1 --show #--showletters --lmtype=kenlm --lmweight=0.94468978683208 --eosscore=-8.4672579451673 --maxload=2

The output is:

|T|: she always turns to him when she is in real trouble
|P|: the people
[sample: 279.wav, WER: 100%, LER: 78.4314%, slice WER: 100%, slice LER: 78.4314%, decoded samples (thread 0): 1]
|T|: none of this ha d made an impression on the boy
|P|: it is
[sample: sample-003828.wav, WER: 100%, LER: 91.4894%, slice WER: 100%, slice LER: 84.6939%, decoded samples (thread 0): 2]

Question#4: Why the output is so bad? What does it signify that training is not okay?

Sep 01 '20 14:09 Dr-AyanDebnath

@tlikhomanenko @vineelpratap Kindly solve my issue. I am facing this problem for a long time.

Sep 02 '20 19:09 Dr-AyanDebnath

Hi, I think there is an issue with the config you are using for Test binary. Running Test should give same result as lists/dev.lst-WER that you see in train logs (| lists/dev.lst-WER: 33.94 |). Are you sure that you are using the correct token, lexicon set, model while running Test binary ?

Sep 03 '20 00:09 vineelpratap

Let me give more details.

Question#1: What is the difference between Decoder and Test command while keeping every parameter same in the config file? (using a greedy path for both)

Test.cpp binary is using only for Viterbi WER computation: It is computing only argmax of each time stamp. Any lm is not used, you cannot run with zerolm or whatever lm you pass. The only flags from the decoder side in test.cpp will be used: uselexicon and lexicon, the rest like beam related flags and different scores will be ignored. The uselexicon flags define behaviour with OOV: if you are using lexicon then OOV words you produce during Viterbi path will be mapped to unk and your WER/LER will be a bit higher. So to sum up you need to run Test.cpp like

wav2letter/build/Test \
  --am=path/to/train/am.bin \
  --maxload=10 \
  --test=path/to/test/list/file \
  --tokensdir=path/to/tokens/dir \
  --tokens=tokens.txt \
  --lexicon=path/to/the/lexicon/file \
  --uselexion=false

Question#2: How to use zeroLM?

Your config looks fine for zerolm. Probably the error because you have space after lm= , could you make sure there is no space? Also remember than in case of --uselexicon=true you need to tune/set also the wordscore which can help a bit.

Question#3: Will zeroLM provide same result to that of the Greedy path (where all flags are the same except inserting zeroLM flag)?

Nope, see above comment on test.cpp. In case of tuning wordscore and eosscore it could give a bit better result than Viterbi. Also Test.cpp is doing only argmax at each time, while zerolm will really keep the beam.

Question#4: Why the output is so bad? What does it signify that training is not okay?

You cannot just switch to another decoder way (zeroLM -> kenLM) with the same parameters, they are totally different functions you can think. So you need to tune parameters. So in case of zeroLM - wordscore and eosscore; in case of kenlm - wordscore, eosscore, lmweight.

Hope, this is helpful.

Sep 03 '20 16:09 tlikhomanenko

wav2letter wav2letter copied to clipboard

Higher Error Rate while Decoding

wav2letter
wav2letter copied to clipboard