axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

Axolotl does not respect `tokenizer.json` and changes BOS, EOS and first token mapping for Yi

Open DreamGenX opened this issue 1 year ago • 1 comments

Please check that this issue hasn't been reported before.

  • [X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I have trained Yi 34B which uses LlamaTokenizer, but has different pre_processor post_processor and decoder. It also uses atypical BOS and EOS tokens.

You can see the differece in resulting tokenization here:

hello world
01-ai/Yi-34B-200K
['hello', '▁world']
[33228, 1504]
dreamgen/opus-v1-34b
['▁hello', '▁world']
[33653, 1504]
----
<|startoftext|>
01-ai/Yi-34B-200K
['<|startoftext|>']
[1]
dreamgen/opus-v1-34b
['<|startoftext|>']
[64000]
----
<s>
01-ai/Yi-34B-200K
['<', 's', '>']
[59666, 59575, 59644]
dreamgen/opus-v1-34b
['<s>']
[1]
----
<|endoftext|>
01-ai/Yi-34B-200K
['<|endoftext|>']
[2]
dreamgen/opus-v1-34b
['<|endoftext|>']
[64001]
----
</s>
01-ai/Yi-34B-200K
['</', 's', '>']
[1359, 59575, 59644]
dreamgen/opus-v1-34b
['</s>']
[2]
----

First token of each sequence is tokenized differently, because Yi should not be adding virtual space at the start, unlike Llama 2.

You can see the config diff here: https://huggingface.co/01-ai/Yi-34B-200K/raw/main/tokenizer.json https://huggingface.co/meta-llama/Llama-2-7b-hf/raw/main/tokenizer.json

Also, this seems to affect token ids and/or token id mapping for the BOS and EOS tokens. For Yi, these should be <|startoftext|> and <|endoftext|> not <s> and </s>.

Current behaviour

(Mentioned above)

Steps to reproduce

You can fine-tune Yi 34B for 1 step, with no token changes in the yaml and inspect the generated tokenizer files. You will see that tokenizer.json is not generated. You can then load the tokenizer from the output dir and run the tests above.

Config yaml

I had the following in my yaml, but it should not and does not have any influence:

special_tokens:
  additional_special_tokens: ["<|im_start|>", "<|im_end|>"]
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

No response

Which Operating Systems are you using?

  • [X] Linux
  • [ ] macOS
  • [ ] Windows

Python Version

N/A

axolotl branch-commit

Don't have the env anymore, but should be still reproducible

Acknowledgements

  • [X] My issue title is concise, descriptive, and in title casing.
  • [X] I have searched the existing issues to make sure this bug has not been reported yet.
  • [X] I am using the latest version of axolotl.
  • [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

DreamGenX avatar Feb 23 '24 09:02 DreamGenX

Can you also provide screenshots of the out directory here for reference?

NanoCode012 avatar Feb 23 '24 18:02 NanoCode012