tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

llama3 tokenizer doesn't round trip

Open josharian opened this issue 1 year ago • 3 comments
trafficstars

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer("hello !")
{'input_ids': [128000, 15339, 758], 'attention_mask': [1, 1, 1]}
>>> tokenizer.decode([128000, 15339, 758])
'<|begin_of_text|>hello!'

Observe that the input has a space before the ! and the output does not.

josharian avatar Jun 03 '24 22:06 josharian

This does not reproduce using the upstream llama3 tokenizer.model and tiktoken.

josharian avatar Jun 03 '24 23:06 josharian

I think the same issue was mentioned, that is because of the transformers layer's clean_up_tokenization_spaces. See this: https://github.com/huggingface/transformers/issues/31187

ArthurZucker avatar Jun 05 '24 07:06 ArthurZucker

We are gonna deprecate and remove this flag 😉

ArthurZucker avatar Jun 05 '24 07:06 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 06 '24 01:07 github-actions[bot]