icefall
icefall copied to clipboard
About fast_beam_search
When I used the unigram subword method on Japanese, the WER got by fast_beam_search is expected, but when I used bpe subword, fast_beam_search got a lot of deletion errors as follows:
unigram: Overall -> 7.99 % N=37537 C=35053 S=1640 D=844 I=514
bpe: Overall -> 12.29 % N=37537 C=33232 S=1422 D=2883 I=309
unigram: (分->わ) か ら な い と こ ろ は 任 せ (切->き) り に す る と こ ろ が あ る と こ ろ が あ る の で そ れ を 直 し て い た だ き た い
bpe: (分 か ら な い と こ ろ は 任 せ 切->*) り (に す る と こ ろ が あ る と こ ろ が あ る の で そ れ を 直 し て い->*) た (だ き た->*) い
What are the possible reasons for this difference
@LoganLiu66 Did you retrain the model when changing the bpe model? Please check that if the tokens.txt files for two bpe models are the same?
@LoganLiu66 Did you retrain the model when changing the bpe model? Please check that if the tokens.txt files for two bpe models are the same?
I train two models both from scratch, and two bpe models are located in different dirs, selected by --lang-dir
After reducing --beam
from 20.0 to 2.0 and --max-contexts
from 8 to 2, it seems to be better(WER from 12.29 to 7.81) , but still higher than greedy_search
I have worried in the past that there might be certain conditions of training data, where fast beam search (or beam search in general) could lead to a lot of deletions.
If the training data contained long stretches of audio that had no text in the transcripts, the RNN-T would learn to output nothing if the context did not correspond to the "correct" context. This could lead to deletions of long strings of words having a relatively high probability, which could end up being the best path if the words that were spoken were unclear.
I don't understand, though, what you mean about unigram subword versus bpe subword, can you be more specific about what you did exactly?
I first train a unigram subword model by using spm tool with setting model_type="unigram"
, and use this model to tokenize input. But I find many words are recognized to <unk>, such as
織->⁇ 酵->⁇ 糖->⁇ 胞->⁇ 億->⁇ 属->⁇ 浅->⁇
I think this may be related to the unigram subword method, because these unrecognized words are not occur in tokens.txt. So I train a bpe subword model by using spm tool with setting model_type="bpe"
. The results show that bpe subword can reduce <unk> predictions and the WER on greedy search are expected. But when I using fast_beam_search with default setting, it shows worse result than unigram.
How many symbols are in your vocabulary? I recommend to set byte_fallback=True and coverage=0.98 (I think there is an option called "coverage", I might be wrong about the name) so that it will represent rarer characters as bytes.
The vocab size is set to 5000. I set character_coverage=0.98
in unigram because RuntimeError occurs when training [Vocabulary size is smaller than required_chars. 5000 vs 5092. Increase vocab_size or decrease character_coverage with --character_coverage option]
. I don't use character_coverage=0.98
when bpe model training because it can be trained normally. Could this be a possible reason for the difference?
I am not 100% sure of the behavior. I thought even with the byte_fallback option it required that character_coverage=0.98 option to run without crashing.
Do your training curves look normal on the tensorboard plots? It could be that your system just didn't converge well.
Yes, the training curves look normal.
OK. Well, I have been surprised in the past that we don't normally see a lot of deletions with fast_beam_search, since if the training data contains missing transcripts the model would learn to skip many characters with good probability. That does not explain why it specifically happens with this type of BPE unit, of course. I don't know if we will find out. Maybe you can try to spot any pattern, especially related to what the two-symbol left-context is just before deleted runs of characters. It could be that there is a particular word sequence in your training data that tends to have untranscribed words following it.
Introducing noise into the token history when training, as passed into the decoder, might be another way to stop this from happening so much. I think the original RNNT paper may have done this kind of thing, although it wasn't uniform noise.