wenet
wenet copied to clipboard

Published 20 hours ago •

Reame
Issues

The space problem in English ASR model ？

Open ziyu123 opened this issue 3 years ago • 2 comments

我在训练英文asr，char建模，即dict=['<blank>', '▁', 'a', 'b', 'c', 'd', 'e', ... ,'z','<sos/eos>']共29个建模单元，空格用“▁”表示，在添加LM打分(英文LM是基于word的3-gram)，并用ctc_decoder解码后，'▁' 符号没有了，比如期望输出"how▁are▁you", 现在输出“howareyou”了，怎么输出结果没有空格了呢？

Sep 21 '22 11:09 ziyu123

ctc_decoder 需要设置 space_id, blank_id, 你的字典应该分别是 1，0，你设置了吗？

Sep 22 '22 02:09 yuekaizhang

ctc_decoder 需要设置 space_id, blank_id, 你的字典应该分别是 1，0，你设置了吗？

这个是设置了的，感觉是在prefix path路径reverse给忽略了

Sep 22 '22 04:09 ziyu123

plz see:

https://github.com/wenet-e2e/wenet/blob/2e7838d0fee36b4f7e7932f1be605d3ec64e7d52/runtime/core/post_processor/post_processor.h#L37

Sep 26 '22 06:09 rookie0607

and also see this:

https://github.com/wenet-e2e/wenet/issues/583#issuecomment-907994058

A simple solution is setting language_type = kIndoEuropean ;

Sep 26 '22 09:09 xingchensong

谢谢yuekaizhang, rookie0607, xingchensong的回复，这个问题解决了，参考 https://github.com/Slyne/ctc_decoder/issues/9
目前，应该是ctc_decoder 不支持非“ ”的空格建模

Sep 26 '22 15:09 ziyu123