tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Extra [SEP] added for ModernBERT decoder model

Open janimo opened this issue 5 months ago • 1 comments

When encoding for https://huggingface.co/jhu-clsp/ettin-decoder-17m with add_special_tokens=true a a [CLS] (bos token) is correctly prependend but also a [SEP] (eos token) is added at the end of the sequence as if it were a BERT encoder. The Python transformers tokenizer pipeline does not add the [SEP].

janimo avatar Jul 19 '25 11:07 janimo

Hello,

I'm not one of the maintainers, but I can't seem to reproduce this.

from tokenizers import Tokenizer
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-17m")
print(tok.encode_plus("hello 123").tokens())
# ['[CLS]', 'hello', 'Ä 123', '[SEP]'
tokenizer = Tokenizer.from_pretrained("jhu-clsp/ettin-encoder-17m")
print(tokenizer.encode("hello 123"))
# ['[CLS]', 'hello', 'Ä 123', '[SEP]'

stephantul avatar Aug 05 '25 12:08 stephantul