tokenizers
tokenizers copied to clipboard
Extra [SEP] added for ModernBERT decoder model
When encoding for https://huggingface.co/jhu-clsp/ettin-decoder-17m with add_special_tokens=true a a [CLS] (bos token) is correctly prependend but also a [SEP] (eos token) is added at the end of the sequence as if it were a BERT encoder. The Python transformers tokenizer pipeline does not add the [SEP].
Hello,
I'm not one of the maintainers, but I can't seem to reproduce this.
from tokenizers import Tokenizer
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-17m")
print(tok.encode_plus("hello 123").tokens())
# ['[CLS]', 'hello', 'Ä 123', '[SEP]'
tokenizer = Tokenizer.from_pretrained("jhu-clsp/ettin-encoder-17m")
print(tokenizer.encode("hello 123"))
# ['[CLS]', 'hello', 'Ä 123', '[SEP]'