tokenizers
tokenizers copied to clipboard
Strange warnings with tokenizer for some models
Hello all,
Thank you for your excellent work here! We are using Tokenizer::from_file to load the tokenizer.json file from HF hub. However, it produces many warnings when loading the Phi3 tokenizer:
2024-05-09T12:11:56.647710Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|endoftext|>' was expected to have ID '32000' but was given ID 'None'
2024-05-09T12:11:56.647734Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|assistant|>' was expected to have ID '32001' but was given ID 'None'
2024-05-09T12:11:56.647737Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder1|>' was expected to have ID '32002' but was given ID 'None'
2024-05-09T12:11:56.647739Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder2|>' was expected to have ID '32003' but was given ID 'None'
2024-05-09T12:11:56.647742Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder3|>' was expected to have ID '32004' but was given ID 'None'
2024-05-09T12:11:56.647744Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder4|>' was expected to have ID '32005' but was given ID 'None'
2024-05-09T12:11:56.647746Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|system|>' was expected to have ID '32006' but was given ID 'None'
2024-05-09T12:11:56.647748Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|end|>' was expected to have ID '32007' but was given ID 'None'
2024-05-09T12:11:56.647750Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder5|>' was expected to have ID '32008' but was given ID 'None'
2024-05-09T12:11:56.647752Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder6|>' was expected to have ID '32009' but was given ID 'None'
2024-05-09T12:11:56.647760Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|user|>' was expected to have ID '32010' but was given ID 'None'
I have also noticed this for Phi2 and Llama3, although I see no tokenization errors in the encoded or decoded.
Is there a way to disable this warning, or am I misconfiguring something? Thank you!