OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Fix tokenizer.encode() to respect add_special_tokens=False parameter

Open ved1beta opened this issue 8 months ago • 0 comments

Problem

The add_special_tokens=False parameter in the tokenizer's encode/encode_batch methods doesn't work as expected. Even when set to False, special tokens (like EOS) are still being added to the encoded output.

Root Cause

The issue occurs because the add_special_tokens parameter is not being passed to the base tokenizer's encode_batch method. While our code correctly handles the parameter after encoding, by that point the base tokenizer may have already added special tokens.

Solution

This PR adds a check to see if the base tokenizer's encode_batch method supports the add_special_tokens parameter, and if so, passes it to ensure special tokens are not added by the base tokenizer. This provides backward compatibility with tokenizers that don't support this parameter.

Testing

I've verified the fix by testing encodings with and without special tokens:

  • add_special_tokens=False now correctly returns only content tokens without the EOS token
  • add_special_tokens=True continues to work as before, returning content tokens plus the EOS token

Fixes #765

ved1beta avatar Apr 07 '25 11:04 ved1beta