muzic icon indicating copy to clipboard operation
muzic copied to clipboard

【meloform】why all token id in dictionary is 0?

Open punkcure opened this issue 2 years ago • 1 comments

Notice that gen_dictionary function uses the variable ‘num’ to represent each token's id, but 'num' keeps 0 for the whole process, so all tokens in the dictionary are 0, is that a bug, or does it make any sense? image image

punkcure avatar Dec 02 '22 07:12 punkcure

Actually, it is not the token ids. This dictionary is created especially for using fairseq framework. You can refer to https://github.com/facebookresearch/fairseq/blob/f131336fc303992cf309be3953bf523e1654fa1f/fairseq/data/dictionary.py#L125 for how it loads the dictionary, especially the add_symbol() function. The variable "num" is just the initial count of the tokens.

peillu avatar Dec 09 '22 05:12 peillu