[Bug] Difference in token ids between Hugging Face tokenization scheme vs llm-export scheme
Hello @wangzhaode , I would like to report an issue which I recently discovered. I can see if a sentence piece model is used for tokenizer ( like llama2) , llm-export replaces the "▁" prefix with " " for every token if present (https://github.com/wangzhaode/llm-export/blob/185fcabc00b0aa724fe037582c4d90013596f11c/llm_export.py#L363C16-L363C65) This modified token is encoded and saved to tokenizer.txt and decoded and used in MNN-LLM inference.
It is possible that there are tokens in vocab like "▁The" as well as "The". So "▁The" becomes " The" and "The" remaims "The"
Now if I use the same prompt and prints ids for them using Hugging Face AutoTokenizer and MNN tokenizer then results are as follow
For a prompt "The train is moving very fast!" Hugging Face Tokenizer returns ids as:
**prompt encode to ids:**
[1, 450, 7945, 338, 8401, 1407, 5172, 29991]
**ids decode to string again:**
<s> The train is moving very fast!
MNN implementation of tokenizer returns
**prompt encode to ids:**
[1 ,1576, 7945, 338, 8401, 1407, 5172, 29991]
**ids decode to string again:**
<s>The train is moving very fast!
Please notice how Hugging Face adds a black space " " before "The" in the decode of the ids back. This impacts the output generation as token id is changed between both. I am not sure which approach is correct, Hugging face or MNN.
I feel that input promp only had "The" and not " The" so MNN approach is better but Hugging face has been a standard as well.
Can you give me some information and your opinion.
Thanks a lot