tokenizers
tokenizers copied to clipboard
llama3 tokenizer doesn't round trip
trafficstars
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer("hello !")
{'input_ids': [128000, 15339, 758], 'attention_mask': [1, 1, 1]}
>>> tokenizer.decode([128000, 15339, 758])
'<|begin_of_text|>hello!'
Observe that the input has a space before the ! and the output does not.
This does not reproduce using the upstream llama3 tokenizer.model and tiktoken.
I think the same issue was mentioned, that is because of the transformers layer's clean_up_tokenization_spaces. See this: https://github.com/huggingface/transformers/issues/31187
We are gonna deprecate and remove this flag 😉
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.