athena-decoder
athena-decoder copied to clipboard
<unk> and <eps>
Hey gusy, I finally got some spare time to look into this now. Thanks a lot for putting this together!
I'm looking at the symbol tables fro words and characters. I noticed that 0 was reserved for
Shall we reserve 0 for
Hey gusy, I finally got some spare time to look into this now. Thanks a lot for putting this together!
I'm looking at the symbol tables fro words and characters. I noticed that 0 was reserved for in words.txt, but was used for in characters.txt. As a results, in the resulting SG.fst graph, on the output side you have separate and symbols, while on the input side, you have a mixed and symbol. This is because OpenFST treat 0 as epsilon in all algorithms by default.
Shall we reserve 0 for as long as OpenFST is involved? This requires changes to both Athena and Athena-decoder. Correct me if I'm wrong though. @tjadamlee @godjealous
Thanks for your interest in athena-decoder project.
Actually, we always reserve 0 for epsilon on the input side and output side in WFST. As you have mentioned, symbol 0 is reserved in file words.txt. Symbol 0 is also reserved in file characters_disambig.txt.
The input symbol table for SG.fst graph is file "characters_disambig.txt" rather than the file "characters.txt". The output symbol table for SG.fst graph is file "words.txt".
Compared with file "characters.txt", file "characters_disambig.txt" contains some extra information including epsilon symbol and some disambiguate symbols.