wav2letter icon indicating copy to clipboard operation
wav2letter copied to clipboard

Unexpected decoding behaviour with Beam Search Decoder using a German AM and LM

Open realbaker1967 opened this issue 5 years ago • 4 comments

Hi there,

I am trying to do Beam Search Decoding with a pre-trained German acoustic model (trained with wav2letter) and a pre-trained German N-gram language model (order 6). Models are from Zamia-Speech.

However, decoded text sometimes misses first word and often misses the last word. The second issue is more frequent. For example:

Annotation: Sie pflegten die Kranken und verbanden die Verwundeten. Hypothesis: pflegt eine kranken und verwandten die verwundeten en

This is how my decode.cfg looks like:

--am=root/wav2letter/model/acoustic_model.bin --tokensdir=root/wav2letter/model/ --tokens=tokens.txt --lexicon=root/wav2letter/model/lexicon.txt --lm=root/wav2letter/model/language_model.bin --datadir=root/host --test=list.lst --lmweight=4 --wordscore=2.2 --beamsize=2500 --beamscore=40 --silweight=-1 --nthread_decoder=1 --smearing=max --show

What might be the reason for above problem? By the way, the original language model was in arpa format. I convert that using KenLM's build_binary.

Thank you in advance

realbaker1967 avatar May 29 '20 15:05 realbaker1967

Hi @realbaker1967,

Converted arpa file into bin is not a problem, should be fine - here you can try to specify directly arpa file, w2l works with arpa too (just to be sure that arpa gives the same as bin).

I need some info form you to understand potential problems of this:

  • With which criterion did you train acoustic model?
  • Your language model is word-based, right?
  • What is your token set: letters, word pieces?
  • Can you show head of your lexicon?
  • What is the arch for your am (some info here would be helpful)?
  • Did you notice the same problem if you compute Viterbi path? It is good to check, so that maybe you have a problem with acoustic model itself.

tlikhomanenko avatar May 30 '20 04:05 tlikhomanenko

I think zamia speech is an OG 1.6gb conv_glu asg model

lunixbochs avatar May 30 '20 18:05 lunixbochs

Dear @tlikhomanenko,

Sorry for the late response. I've tried your suggestions, here the observations:

Converted arpa file into bin is not a problem, should be fine - here you can try to specify directly arpa file, w2l works with arpa too (just to be sure that arpa gives the same as bin).

I've got the same hypthoses with arpa formatted langage model.

* With which criterion did you train acoustic model?

ASG

* Your language model is word-based, right?

Since it is a 6-gram language model, I assume it is word based. Is my assumption correct?

* What is your token set: letters, word pieces?

It is letters.

* Can you show head of your lexicon?

a ? 'e I a_1 ? '{ a_2 ? 'a: a_3 ? 'A: aa ? 'a: aaaah ? 'a: a: aaah ? 'a a: aab ? 'a: p aabs ? 'a p s aach ? 'a x

* What is the arch for your am (some info here would be helpful)?

It is a conv_glu model: V -1 1 NFEAT 0 WN 3 C NFEAT 400 13 1 170 GLU 2 DO 0.2 WN 3 C 200 440 14 1 0 GLU 2 DO 0.214 WN 3 C 220 484 15 1 0 GLU 2 DO 0.22898 WN 3 C 242 532 16 1 0 GLU 2 DO 0.2450086 WN 3 C 266 584 17 1 0 GLU 2 DO 0.262159202 WN 3 C 292 642 18 1 0 GLU 2 DO 0.28051034614 WN 3 C 321 706 19 1 0 GLU 2 DO 0.30014607037 WN 3 C 353 776 20 1 0 GLU 2 DO 0.321156295296 WN 3 C 388 852 21 1 0 GLU 2 DO 0.343637235966 WN 3 C 426 936 22 1 0 GLU 2 DO 0.367691842484 WN 3 C 468 1028 23 1 0 GLU 2 DO 0.393430271458 WN 3 C 514 1130 24 1 0 GLU 2 DO 0.42097039046 WN 3 C 565 1242 25 1 0 GLU 2 DO 0.450438317792 WN 3 C 621 1366 26 1 0 GLU 2 DO 0.481969000038 WN 3 C 683 1502 27 1 0 GLU 2 DO 0.51570683004 WN 3 C 751 1652 28 1 0 GLU 2 DO 0.551806308143 WN 3 C 826 1816 29 1 0 GLU 2 DO 0.590432749713 RO 2 0 3 1 WN 0 L 908 1816 GLU 0 DO 0.590432749713 WN 0 L 908 NLABEL

* Did you notice the same problem if you compute Viterbi path? It is good to check, so that maybe you have a problem with acoustic model itself.

I assume you meant here to try greedy path. I've tried with the following configuration:

  --am=root/wav2letter/model/acoustic_model.bin
  --tokensdir=root/wav2letter/model/
  --tokens= tokens.txt
  --lexicon= root/wav2letter/model/lexicon.txt
  --datadir= root/host
  --test= list.lst
  --nthread_decoder= 1
  --show

However, I got almost garbage results like:

hyp: siehe pfleg cut te te te te te te te te inne kranken und te te te te verwanden die pp pe pp pp pp pe pe pp pe pe pe pe pe pp ruh ndaho ref: Sie pflegten die Kranken und verbanden die Verwundeten.

Did I do a mistake when computing viterbi path?

Best

realbaker1967 avatar Jun 26 '20 12:06 realbaker1967

@realbaker1967

Sorry for the late response. I've tried your suggestions, here the observations:

Sure, no problem.

  • To check if your ngram model is word-based (just to make sure and FYI ngram doesn't mean to be word-based) have a look at the head of arpa file that it contains words. My guess - you have word-based.
  • Just curious (I don't know German) aach ? 'a x is it letters transcription for word aach? Looks like phonemes, no?
  • For Viterbi run Test binary with this setting, not Decode binary
  --am=root/wav2letter/model/acoustic_model.bin
  --tokensdir=root/wav2letter/model/
  --tokens=tokens.txt
  --lexicon=root/wav2letter/model/lexicon.txt
  --datadir=root/host
  --test=list.lst
  --nthread_decoder=1
  --show
  --showletter
  • For Viterbi do you have the same WER as in the logs during training (run on the same data)?

tlikhomanenko avatar Jun 28 '20 07:06 tlikhomanenko