wav2letter How to decode by using a pre-trained acoustic model and customized 3-gram LM

trafficstars

i would like to use the pre-trained model which has been trained in the Librispeech dataset into my own customized data, with changing only the language model and the Lexicon file.

i used the sota Acoustic model dev-clean (LibriSpeech | TDS CTC clean) "https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019", furthermore i used the advice from @tlikhomanenko to create my new customize lexicon based on the Sentencepiece model in this link "https://github.com/facebookresearch/wav2letter/issues/737#issuecomment-654621541"

here is the lexicon looks like:

!\exclamation-mark _exclamation po i n t !\exclamation-mark _exclamation p o i nt !\exclamation-mark _exclamation p o i n t !\exclamation-mark _ex c lam ation po int !\exclamation-mark _ex cla m ation po int "\close-double-quotes _closed ou ble q u ot e "\close-double-quotes _closed ou ble q u o te "\close-double-quotes _close d ou ble q u ot e "\close-double-quotes _closed o u ble q u ot e "\close-double-quotes _closed ou ble q u o t e "\close-double-quotes _closed o ub le q u ot e PLC _ p l s e e PLEDs _pe ee led e es PLEDs _pe e el ed e es PLEDs _pe ee led ee s PLEDs _pe e e led e es complicates _comp li c at es complicates _comp l ic a tes complicating _comp li cat ing complicating _comp l ic ating complicating _comp li c ating complicating _comp li ca ting ›\close-angle-bracket _right ang l e bra ck ets €\euro-sign _eu ro s ign

here is my customized 3 gram.arpa language model looks like:

\data
ngram 1=51664 ngram 2=8394983 ngram 3=17809566

\1-grams: -99.00000 ~~0.000000 -99.00000~~ 0.000000 -3.952232 !\exclamation-mark 0.152154 -4.227973 "\close-double-quotes -0.699384 -5.570467 "\double-quotes -0.320485 -4.293772 "\open-double-quotes 0.208263 -7.298170 .asia 0.000000 -6.156094 .at -0.180662

decoding file:

--lexicon=/var/data/customized_data-wordpiece-dts_ctc_web-10000-nbest10.lexicon --tokensdir=/var/data/training/pretrained_models/w2l/model_from_website/tds_ctc/ --tokens=librispeech-train-all-unigram-10000.tokens

--am=/var/pretrained-model_from_w2l_website/tds_ctc/am_tds_ctc_librispeech_dev_clean.bin --lm=/var/data//model_lm/customized_lm_3gram.arpa

--test=/var/data/lists/customized_data.lst

--listdata=true --lmweight=0.67470637680685 --wordscore=0.62867952607587 --uselexicon=True --decodertype=wrd --lmtype=kenlm --silscore=0 --beamsize=250 --beamsizetoken=100 --beamthreshold=100 --maxtsz=100000000000000 --mintsz=0 --maxisz=100000000000000 --minisz=0 --nthread_decoder=4 --smearing=max --show=true --input=wav

after i used decoding in this customized language model and the pre-trained acoustic model, the recognition is not good at all, however when i used the same technique in Kaldi Speech Recognition Toolkit to evaluate the same customized datasets, i got a very good recognition rate.

-Notes: i used this pretrained-model to evaluate the test-clean Librispeech with the 3gram LM, i got approximately 4% WER. i am wandering if the new lexicon that i created is correct or not (It contains special and capital characters, which in the original lexicon contains only lower characters). i really appreciate your advice.

original Lexicon looks like : a _a a _ a a'azam _a ' a za m a'azam _a ' a z am a'azam _a ' a z a m a'azam _ a ' a za m a'azam _ a ' a z am a'azam _ a ' a z a m a'll _a 'll a'll _a ' ll a'll _ a 'll a'll _a ' l l a'll _ a ' ll a'll _ a ' l l a'most _a ' most a'most _a ' mo st a'most _a ' mo s t a'most _a ' m o st a'most _ a ' most a'most _a ' m os t a'most _a ' m o s t a'most _ a ' mo st a'most _ a ' mo s t
the original 3-gram.arpa language model looks like:

\data
ngram 1=200003 ngram 2=38229161 ngram 3=49712290

\1-grams: -2.348754 -2.752519 <UNK> -0.9697837 -99 -2.408548 -2.619969 A -1.56262 -7.211563 A''S -0.1495221 -6.221141 A'BODY -0.2521624 -6.583487 A'COURT -0.1844242 -6.240468 A'D -0.2419162 -7.108924 A'GHA -0.1740987 -6.260695 A'GOIN -0.407374 -5.804425 A'LL -0.3104859 -5.638036 A'M -0.2983787 -6.221141 A'MIGHTY -0.2186911 -6.885612 A'MIGHTY'S -0.1800996 -4.996963 A'MOST -0.4291549 -5.247992 A'N'T -0.4541333 -6.68067 A'PENNY -0.1663875 -5.518031 A'READY -0.3329575 -6.202638 A'RIGHT -0.3302211

Nov 23 '20 11:11 kerolos

As lexicon is mapping of words into AM tokens sequence to restrict the search in the beam-search decoder, your lexicon file should contain mapping of words to the sequence of tokens only from librispeech-train-all-unigram-10000.tokens. Also you LM should use the same words as you have in lexicon. So you need to have either upper case of lower case in both.

Nov 25 '20 03:11 tlikhomanenko

thanks for your reply, i really appreciate that.

i did a mistake when i decoded the Librispeech test-clean dataset. the 3 gram.arpa is converted to lower case first and then transferred it to binary format, which match the words in the Lexicon (lower case).

result with the wrong LM 3-gram.arpa: [Decode /var/data/librispeech/lists/test-clean.lst (2616 samples) in 712.07s (actual decoding time 2.16s/sample) -- WER: 6.97887, LER: 1.4074] -results with right LM 3-gram.bin: [Decode /var/data/librispeech/lists/test-clean.lst (2616 samples) in 679.095s (actual decoding time 2.07s/sample) -- WER: 3.2862, LER: 1.50971]

I will give you an example what i did in the lexicon to get a better recognition in a word. i used a sentence-piece tools and Token WP model to generate the lexicon. the word remarkable dose not recognized at all in my test set. Here is the part of the customized lexicon based on WP tools for the unremarkable word: unremarkable _unre mark able unremarkable _un re mark able unremarkable _unre ma rk able unremarkable _unre mar k able unremarkable _unre mark a ble unremarkable _unre mark ab le unremarkable _unre mar ka ble unremarkable _un r e mark able unremarkable _unre ma r k able unremarkable _unre m ar k able

-by adding in the lexicon this line unremarkable _un _remarkable (_un and _remarkable tokens are already existed in the token list " librispeech-train-all-unigram-10000.tokens.")., it enhances the recognition of this word "unremarkable" from "0 %" to "12 time recognized out of 25 times appeared in my test set".it always recognized as remarkable instead of unremarkable. question 1: i would like to know you advice, because i am a little bit confused (i used two underscore tokens to map one word, is that okay ?). Note: i deactivate this flag from the decode file (--wordseparator ).

question 2: there are some tokens existed in the token list in two different formats, which in my option are similar and confuse the overall system to recognize: (token with underscore VS the same token without underscore) for example _line vs line in the token list.

question 3: Is it better to use Phonetic based lexicon instead of Letter-based or Word-piece lexicon in my case (in Kaldi i used only a Letter-based lexicon and Phonetic-based lexicon and got a better result, do you think also the problem in W2L in my case is caused by the acoustic model has been trained in WP lexicon or by using this deterministic WP Lexicon ) ?

Nov 26 '20 10:11 kerolos

you should not deactivate wordseparator, and each underscore is a boundary between words that is why it will anyway be separate word "remarkable". You need to set its transcription to "_un remarkable" or just apply word piece model to the word to get it wp sequence
_line means that you start new word, while "line" is just continue of word so it could be only in the middle/end of the word.
hard to say, you need to experiment, also the problem could be because your data has different lexicon than librispeech so wps are not appropriate to your data. I would try to train letter based AM.

Nov 28 '20 01:11 tlikhomanenko

thanks for your reply.

1- The remarkable token is only appeared with underscore in the token list, like that (_unremarkable). when i used WP model to get the sub-word sequence (with nbest = 10, it is in my previous comment), it does not recognize any word in my test set. and i am wondering, why it performed better when i separate the word in this way (_un _remarkable) in the lexicon.

2- Do you think the similarity between two words (have underscore _ and does not have underscore _ ) in the token list might confuse the DNN to predict the right label (class), or this issue will be handled later during a beam search decoding?

3- I am training the Librispeech dataset 460 hrs using Phonetic-based lexicon (streaming convnet and tds-ctc), but the loss of the training dateset reaches to 16.5 and can not be converged anymore (after epochs = 120). i am using a different token than the original one https://www.openslr.org/resources/11/librispeech-lexicon.txt, because the token is more simpler.

--The Phonetic token looks like that: ay uw aw oy jh dh ch p hh zh uh |

I think the next step i will try to train those two model using Letter-based Lexicon as you mentioned in the previous comment and i will let you know, if this helps to create a generic acoustic model or not . thanks in advance

Dec 01 '20 11:12 kerolos

Hello @tlikhomanenko , 3- I tried to train (SOTA / TDS CTC ) the librispeech dataset based on Letter lexicon. but after 120 epochs, the loss could not be decreased. flagsfile: --runname=/var/data/training/w2l/sota_2019/tds_ctc/am/tds_streaming_librispeech_graphene --tag=001LR_00M_6BS_0R_M --train=/var/data/en/LibriSpeech/lists/train-clean-100.lst,/mnt/data/en/LibriSpeech/lists/train-clean-360.lst --valid=dev-clean:/var/data/en/LibriSpeech/lists/dev-clean.lst,dev-other:/var/data/en/LibriSpeech/lists/dev-other.lst --lexicon=/mnt/data/training/w2l/sota_2019/tds_ctc/token_lexicon/graphene/lexicon.txt --arch=/var/data/training/w2l/sota_2019/tds_ctc/am/am_tds_ctc.arch --tokens=/var/data/training/w2l/sota_2019/tds_ctc/token_lexicon/graphene/tokens.txt --surround=| --criterion=ctc --batchsize=6 --lrcosine=false --lr=0.05 #--lrcrit=0.05 --momentum=0.5 --nthread=8 --mfsc=true --enable_distributed=true --logtostderr --onorm=target --sqnorm --gamma=1.0 --showletters=true --input=flac #--warmup=42000

log:

Changing the learning rate does not influence that much to reach the global minimum. as shown in the log file decreasing or increasing the LR does not progress the train process either to converge or to diverge (behavior is the same). what i should change in the Flagsfile to make the model converge very fast? I am using 2 GPU RTX 2080 ti and the train data is 460 hr Librespeech. I really appriciate your advice .

Dec 08 '20 13:12 kerolos

@kerolos: As @tlikhomanenko pointed out in the previous comments, the underscore character "_" denotes that the token starts a new word.

Think about a prefix "ed". Lots of words start with the prefix "ed" - for e.g. Edgbaston, edit, education, etc. In all these instances, the prefix "ed" starts a new word, so token "_ed" would make it into the top 10k tokens if it occurs frequently enough.

The suffix"ed" also appears quite frequently to denote the past tense of a verb - for e.g. summed, listened, banned, etc. In all these instances, it doesn't start a new word, so token "ed" (without an underscore) would also make it into the top 10k tokens.

This is why you may have two instances of token "ed", one with an underscore and one without.

With regards to your 2nd question, I presume (and this could be verified by looking at emissions), both corresponding versions of a token (if they exist in the token set) (with and without underscores) would have a higher probability for an utterance. The word-level LM would help disambiguate which token to chose (whether to start a new word by considering the token starting with the underscore or continue the previous word).

Dec 08 '20 15:12 abhinavkulkarni

Thanks a lot @abhinavkulkarni for your reply. as you suggest to have a look at emission files, to understand and see what is the output looks like in the specific word (unremarkable).

I am wondering why it is difficult to train such an Acoustic model with a different lexicon such as phonetic-based AM or Letter-Based AM.

--train AM using Letter-Based lexicon :

from 1 epoch to 63 epoch --> using lr: 0.001000 from 64 epoch to 67 epoch --> using lr: 0.01000 from 68 epoch to 90 epoch --> using lr: 0.1000 from 91 epoch to 118 epoch --> using lr: 0.40 from 119 epoch to 130 epoch --> using lr: .05 epoch = iteration.

--train AM using phonetic-Based lexicon :

Phonatic-based_std-ctc_WER Phonatic-based_std-ctc_TER Phonatic-based_std-ctc_loss

Dec 09 '20 14:12 kerolos

What stride of the model did you use for phoneme and letter-based cases?

Dec 10 '20 08:12 tlikhomanenko

the model is TDS-CTC (streaming version): stride in C2 layer is 2 (the default one)

SAUG 80 27 2 100 1.0 2 V -1 NFEAT 1 0 C2 1 10 21 1 2 1 -1 -1 R DO 0.0 LN 1 2 TDS 10 21 80 0.05 2400 TDS 10 21 80 0.05 2400 TDS 10 21 80 0.05 2400 TDS 10 21 80 0.1 2400 TDS 10 21 80 0.1 2400 C2 10 14 21 1 2 1 -1 -1 R DO 0.0 LN 1 2 TDS 14 21 80 0.15 3360 TDS 14 21 80 0.15 3360 TDS 14 21 80 0.15 3360 TDS 14 21 80 0.15 3360 TDS 14 21 80 0.15 3360 TDS 14 21 80 0.15 3360 C2 14 18 21 1 2 1 -1 -1 R DO 0.0 LN 1 2 TDS 18 21 80 0.15 4320 TDS 18 21 80 0.15 4320 TDS 18 21 80 0.15 4320 TDS 18 21 80 0.15 4320 TDS 18 21 80 0.2 4320 TDS 18 21 80 0.2 4320 TDS 18 21 80 0.25 4320 TDS 18 21 80 0.25 4320 TDS 18 21 80 0.25 4320 TDS 18 21 80 0.25 4320 V 0 1440 1 0 RO 1 0 3 2 L 1440 NLABEL

should i convert the stride to 1 like in the tutorial model (letter-based ) ?

V -1 1 NFEAT 0 C2 NFEAT 256 8 1 2 1 -1 -1 --> 2 stride R C2 256 256 8 1 1 1 -1 -1 --> 1 stride R C2 256 256 8 1 1 1 -1 -1 --> 1 stride R C2 256 256 8 1 1 1 -1 -1 --> 1 stride R C2 256 256 8 1 1 1 -1 -1 --> 1 stride R C2 256 256 8 1 1 1 -1 -1 --> 1 stride R C2 256 256 8 1 1 1 -1 -1 --> 1 stride R C2 256 256 8 1 1 1 -1 -1 --> 1 stride R RO 2 0 3 1 L 256 512 R L 512 NLABEL

Dec 10 '20 08:12 kerolos

The total stride of the model is 8 which too much for letters and phonemes.

...
 C2 1 10 21 1 2 1 -1 -1
...
 C2 14 18 21 1 2 1 -1 -1
...
 C2 14 18 21 1 2 1 -1 -1
...

You need to use either 2 or 3, maybe 4. So you can just remove lowest layers stride from 2 to 1. FYI: stride 8 means that every 80ms you will output only one letter. With this stride or 16 we can output entire words. So for letters/phonemes you need to do this more frequently.

Dec 11 '20 07:12 tlikhomanenko

wav2letter wav2letter copied to clipboard

How to decode by using a pre-trained acoustic model and customized 3-gram LM

i would like to use the pre-trained model which has been trained in the Librispeech dataset into my own customized data, with changing only the language model and the Lexicon file.

wav2letter
wav2letter copied to clipboard