wav2letter
wav2letter copied to clipboard
Something confusing me in lexiconDecoder for ctc model
when i try to debug ctc model in decoder, in recipes/streaming_convnets, I find something confused. i found the beam search algorithm in lexicon decoder, not prefix beam search for ctc. in ctc, we know that, if we get a sequence output such as "hhee-l-lo", it should output "hello", here, i use "-" to represent blank. but in beam search , we just treat "hh" as 2 sequential token, and search for first 'h' get state 1, and search for second 'h' from state 1 and get state 2. if there exist one word "hh", then we may get a word "hh". But we know that, in ctc, this result is not legal. This means we may output bad result using LexiconDecoder for ctc.
//we eat-up a new token ---------------------- In my understanding, this can prevent "hh" in token search
88 if (opt_.criterionType != CriterionType::CTC || prevHyp.prevBlank ||
89 n != prevIdx) {
90 if (!lex->children.empty()) {
91 if (!isLmToken_) {
92 lmState = prevHyp.lmState;
93 lmScore = lex->maxScore - lexMaxScore;
94 }
95 candidatesAdd(
96 candidates_,
97 candidatesBestScore_,
98 opt_.beamThreshold,
99 score + opt_.lmWeight * lmScore,
100 lmState,
101 lex.get(),
102 &prevHyp,
103 n,
104 -1,
105 false, // prevBlank
106 prevHyp.amScore + amScore,
107 prevHyp.lmScore + lmScore);
108 }
109 }
110
111 // If we got a true word ---- but here, if the new stat is a word, we will add it. this means we will add "hh".
112 for (auto label : lex->labels) {
113 if (!isLmToken_) {
114 auto lmStateScorePair = lm_->score(prevHyp.lmState, label);
115 lmState = lmStateScorePair.first;
116 lmScore = lmStateScorePair.second - lexMaxScore;
117 }
118 candidatesAdd(
119 candidates_,
120 candidatesBestScore_,
121 opt_.beamThreshold,
122 score + opt_.lmWeight * lmScore + opt_.wordScore,
123 lmState,
124 lexicon_->getRoot(),
125 &prevHyp,
126 n,
127 label,
128 false, // prevBlank
129 prevHyp.amScore + amScore,
130 prevHyp.lmScore + lmScore);
131 }
Lexicon decoder for CTC is prefix beam search decoder.
In the branch you pointed we try to continue with a token which different from the previous last token in the prefix. In this case we switch to another LM state for example.
The branch you mentioned about having "hhh" and staying in the same LM state (which means we just squeeze them together to form one token) is here https://github.com/facebookresearch/wav2letter/blob/be863bb941108e95545b94fdf192722699295c63/src/libraries/decoder/LexiconDecoder.cpp#L153 - check on the hypothesis we add in this case https://github.com/facebookresearch/wav2letter/blob/be863bb941108e95545b94fdf192722699295c63/src/libraries/decoder/LexiconDecoder.cpp#L164.