tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Strange warnings with tokenizer for some models

Open EricLBuehler opened this issue 1 year ago • 0 comments

Hello all,

Thank you for your excellent work here! We are using Tokenizer::from_file to load the tokenizer.json file from HF hub. However, it produces many warnings when loading the Phi3 tokenizer:

2024-05-09T12:11:56.647710Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|endoftext|>' was expected to have ID '32000' but was given ID 'None'    
2024-05-09T12:11:56.647734Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|assistant|>' was expected to have ID '32001' but was given ID 'None'    
2024-05-09T12:11:56.647737Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder1|>' was expected to have ID '32002' but was given ID 'None'    
2024-05-09T12:11:56.647739Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder2|>' was expected to have ID '32003' but was given ID 'None'    
2024-05-09T12:11:56.647742Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder3|>' was expected to have ID '32004' but was given ID 'None'    
2024-05-09T12:11:56.647744Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder4|>' was expected to have ID '32005' but was given ID 'None'    
2024-05-09T12:11:56.647746Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|system|>' was expected to have ID '32006' but was given ID 'None'    
2024-05-09T12:11:56.647748Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|end|>' was expected to have ID '32007' but was given ID 'None'    
2024-05-09T12:11:56.647750Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder5|>' was expected to have ID '32008' but was given ID 'None'    
2024-05-09T12:11:56.647752Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder6|>' was expected to have ID '32009' but was given ID 'None'    
2024-05-09T12:11:56.647760Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|user|>' was expected to have ID '32010' but was given ID 'None'    

I have also noticed this for Phi2 and Llama3, although I see no tokenization errors in the encoded or decoded.

Is there a way to disable this warning, or am I misconfiguring something? Thank you!

EricLBuehler avatar May 09 '24 18:05 EricLBuehler