kaldilm icon indicating copy to clipboard operation
kaldilm copied to clipboard

How to figure out which state represents which history in FST text file?

Open Banaf89 opened this issue 1 year ago • 1 comments

I am using the kaldilm library to convert arpa files to FST text format to make an n-gram language model with FSTs. In the Arpa file, you can see the entire n-gram along with its probabilities. However, in the FST txt format, you just see numbers. It's quite easy to find out which number represents which n-gram when the data is limited, but as the amount of data grows, it becomes harder to understand which state represents which history (of previous words).

One possible solution would be to perform DFS on the graph and label each state according to the previous states and the arcs between them, but it would take a long time when the data is large (and hence, the model is large as well). It would be much easier if I had a dictionary that shows what each state number in the final FST text file represents.

Note: I know we have an argument called --keep-symbols but it only stores information about the arcs while I'm interested in knowing what each state represents.

Is there a way to figure out which state represents which history in the FST text file? Thank you for your help.

Banaf89 avatar Apr 24 '23 09:04 Banaf89