OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Issue with tokenizer wrapper

Open davidbrandfonbrener opened this issue 7 months ago • 0 comments

❓ The question

the tokenizer wrapper causes unintended behavior when the tokenizer has a bos token (like the llama tokenizers). In particular, the call to the base_tokenizer encode function will add bos tokens even when add_special_tokens=False.

The issue is that the default here is for the base_tokenizer to have add_special_tokens=True.

This should be fairly easy to fix, but to properly handle tokenizers with bos tokens, the wrapper would need to be changed more broadly.

This also raised the question for me, why is this wrapper needed in the first place instead of using the huggingface library? I wanted to better understand the motivation before making changes.

davidbrandfonbrener avatar Jul 08 '24 18:07 davidbrandfonbrener