athena-decoder icon indicating copy to clipboard operation
athena-decoder copied to clipboard

<unk> and <eps>

Open chenguoguo opened this issue 5 years ago • 1 comments

Hey gusy, I finally got some spare time to look into this now. Thanks a lot for putting this together!

I'm looking at the symbol tables fro words and characters. I noticed that 0 was reserved for in words.txt, but was used for in characters.txt. As a results, in the resulting SG.fst graph, on the output side you have separate and symbols, while on the input side, you have a mixed and symbol. This is because OpenFST treat 0 as epsilon in all algorithms by default.

Shall we reserve 0 for as long as OpenFST is involved? This requires changes to both Athena and Athena-decoder. Correct me if I'm wrong though. @tjadamlee @godjealous

chenguoguo avatar Jan 28 '20 03:01 chenguoguo

Hey gusy, I finally got some spare time to look into this now. Thanks a lot for putting this together!

I'm looking at the symbol tables fro words and characters. I noticed that 0 was reserved for in words.txt, but was used for in characters.txt. As a results, in the resulting SG.fst graph, on the output side you have separate and symbols, while on the input side, you have a mixed and symbol. This is because OpenFST treat 0 as epsilon in all algorithms by default.

Shall we reserve 0 for as long as OpenFST is involved? This requires changes to both Athena and Athena-decoder. Correct me if I'm wrong though. @tjadamlee @godjealous

Thanks for your interest in athena-decoder project.

Actually, we always reserve 0 for epsilon on the input side and output side in WFST. As you have mentioned, symbol 0 is reserved in file words.txt. Symbol 0 is also reserved in file characters_disambig.txt.

The input symbol table for SG.fst graph is file "characters_disambig.txt" rather than the file "characters.txt". The output symbol table for SG.fst graph is file "words.txt".

Compared with file "characters.txt", file "characters_disambig.txt" contains some extra information including epsilon symbol and some disambiguate symbols.

godjealous avatar Feb 06 '20 10:02 godjealous