Incorrect offsets after replace with special token
I'm not sure if this is related to https://github.com/huggingface/tokenizers/issues/892 - the code below replaces digits with the special <digits> token. The tokens and ids are correct but the offsets of the <digits> is off by one. I don't believe this is a general problem with normalizers.Replace - when I replace punctuation with " " it works, also when I introduce whitespace based on a zero-length regex split (i.e. (?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z]) for camel case). The issue I'm seeing only seems to occur with the special character, where the replaced text is greater than length one (I suspect it's probably just related to the latter).
The error seems to be related to the length of the replaced string - if the length is one (i.e. EX1) then the offset is correct (2,3), if two (ie. EX12) then the offsets are off by one (3,4) if 3 (i.e. EX123) then the offsets are off by two (4,5) etc.
EDIT: I've confirmed that the issue is with replace inserting more than than one character - i.e. if I replace digits (\d+) with "0" I get the same incorrect offsets, if I replace a single digit (\d) with "0" the offsets are correct.
import tokenizers
import string
UNK = "<unk>"
DIGITS = "<digits>"
special_tokens=[UNK, DIGITS]
tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE(unk_token = UNK))
tokenizer.normalizer = tokenizers.normalizers.Replace(tokenizers.Regex(r"\d+"), DIGITS)
tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Sequence([
tokenizers.pre_tokenizers.Split(DIGITS, "isolated", False),
tokenizers.pre_tokenizers.WhitespaceSplit(),
])
trainer = tokenizers.trainers.BpeTrainer(
vocab_size=1000,
special_tokens=special_tokens,
initial_alphabet=string.ascii_lowercase
)
tokenizer.train_from_iterator(['EX 01']*100, trainer=trainer)
text = "EX 12"
for offset in tokenizer.encode(text).offsets:
print(text[offset[0]:offset[1]])
Entirely correct !
I didn't pinpoint the issue yet, but it seems to just output the offsets of the last digit regardless of how many digits there are in the string, and only output the last digit ( s/12/123456/g to test)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.