ctcdecode icon indicating copy to clipboard operation
ctcdecode copied to clipboard

probs_seq[i].size() not equal to vocabulary.size()

Open lzj9072 opened this issue 5 years ago • 2 comments

I have double checked the size, read the source code of ctc_beam_decoder.cpp and I finally find out why this occur. My modeling units are constructed by English word instead of English alphabet. And the python code concatenates the vocabulary as one long string(''.join(vocab)), and passes it to cpp code(const char* labels). So if the vocabulary is ["Hello", "World"], actually it becomes ["H", "e", "l", "l", "o", "W", "o", "r", "l", "d"]. Is there any solution for different modeling units?

lzj9072 avatar Dec 11 '19 04:12 lzj9072

Hey, i have the same situation. did you fix the problem?

faresbs avatar Jan 12 '20 20:01 faresbs

@lzj9072 i meet the same situation,and I make some changes of source code. Everything is ok after testing, and it can be suitable for different modeling unit.

This is my repository,

https://github.com/PanXiebit/ctcdecode

The difference between my code and the source code is as follows:

https://github.com/PanXiebit/ctcdecode/commit/a604c93866fb76f0d2e783f78485081b0a943dbf

PanXiebit avatar Jan 15 '20 15:01 PanXiebit