text
text copied to clipboard
Static kTrieMaxLabel=6 causes issues with phoneme-based recognition
Bug Description
I tried building ASR systems on a very common standard task (LibriSpeech-100h) using the torchaudio ctc decoder. This decoder uses the flashlight/text library as decoding backend. While my subword (BPE) based setups worked fine, the phoneme based did not.
The standard librispeech lexicon includes e.g. those 7 words, that in ARPA notation all get the same phone sequence:
BAE B AY#
BAI B AY#
BI B AY#
BUY B AY#
BY B AY#
BY' B AY#
BYE B AY#
Which resulted e.g. in the word BY
not being recognized anymore.
In the log I get the message:
[Trie] Trie label number reached limit: 6
which correctly tells if this limit is applied, but I would like to raise that this limit is very low, and not configurable without re-compiling. Also the message did not look to me like a serious issue at first.
Reproduction Steps
- Use torchaudio ctc_decoder with a phoneme based lexicon containing homophones with more than 6 variations.
After removing the limit check with the following patch, my word-error-rate went from 20.3% to 17.9%:
40,46c40,41
< if (node->labels.size() < kTrieMaxLabel) {
< node->labels.push_back(label);
< node->scores.push_back(score);
< } else {
< std::cerr << "[Trie] Trie label number reached limit: " << kTrieMaxLabel
< << "\n";
< }
---
> node->labels.push_back(label);
> node->scores.push_back(score);
Was there any reason why this arbitrary limit was put there in the first place?
Hello, is there still some interest to discuss this or get this fixed? With the proposed fix the decoder compares really well to our own decoder implementation, and I would like to use it for a scientific publication given the simplicity of using it. Currently I am providing a patch file with the setup / container image which is fine, but I would prefer if this would be fixed in the repository here directly.
If there is interest I can do the PR, but before I just want to clarify if this limit has any reasoning that I do not know about.