[Bug] Difference in token ids between Hugging Face tokenization scheme vs llm-export scheme

Open Nick-infinity opened this issue 1 year ago • 0 comments

Hello @wangzhaode , I would like to report an issue which I recently discovered. I can see if a sentence piece model is used for tokenizer ( like llama2) , llm-export replaces the "▁" prefix with " " for every token if present (https://github.com/wangzhaode/llm-export/blob/185fcabc00b0aa724fe037582c4d90013596f11c/llm_export.py#L363C16-L363C65) This modified token is encoded and saved to tokenizer.txt and decoded and used in MNN-LLM inference.

It is possible that there are tokens in vocab like "▁The" as well as "The". So "▁The" becomes " The" and "The" remaims "The"

Now if I use the same prompt and prints ids for them using Hugging Face AutoTokenizer and MNN tokenizer then results are as follow

For a prompt "The train is moving very fast!" Hugging Face Tokenizer returns ids as:

**prompt encode to ids:**
[1, 450, 7945, 338, 8401, 1407, 5172, 29991]
**ids decode to string again:**
<s> The train is moving very fast!

MNN implementation of tokenizer returns

**prompt encode to ids:**
[1 ,1576, 7945, 338, 8401, 1407, 5172, 29991]
**ids decode to string again:**
<s>The train is moving very fast!

Please notice how Hugging Face adds a black space " " before "The" in the decode of the ids back. This impacts the output generation as token id is changed between both. I am not sure which approach is correct, Hugging face or MNN.

I feel that input promp only had "The" and not " The" so MNN approach is better but Hugging face has been a standard as well.

Can you give me some information and your opinion.

Thanks a lot

Feb 02 '24 13:02 Nick-infinity