Bug: is_pretokenized is not used when calling tokenizer.encode(...)

Open jannessm opened this issue 1 year ago • 0 comments

is_pretokenized doesnt seem to be respected in some cases. The same code given below works in 0.20.0

Code

from tokenizers import Tokenizer, pre_tokenizer
from tokenizers.models import WordPiece

m = WordPiece({'F': 0, '<eos>': 1})
t = Tokenizer(m)
t.pre_tokenizer = pre_tokenizers.Split('', 'isolated')

t.encode(['<eos>'], is_pretokenized=True).ids

Expected to run without any issue but raises the exception:

Exception: WordPiece error: Missing [UNK] token from the vocabulary

It seems to ignore the is_pretokenized flag and wants to apply the pre_tokenizer to the <eos> token.

Nov 29 '24 12:11 jannessm