tokenizers Incorrect offsets after replace with special token

I'm not sure if this is related to https://github.com/huggingface/tokenizers/issues/892 - the code below replaces digits with the special <digits> token. The tokens and ids are correct but the offsets of the <digits> is off by one. I don't believe this is a general problem with normalizers.Replace - when I replace punctuation with " " it works, also when I introduce whitespace based on a zero-length regex split (i.e. (?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z]) for camel case). The issue I'm seeing only seems to occur with the special character, where the replaced text is greater than length one (I suspect it's probably just related to the latter).

The error seems to be related to the length of the replaced string - if the length is one (i.e. EX1) then the offset is correct (2,3), if two (ie. EX12) then the offsets are off by one (3,4) if 3 (i.e. EX123) then the offsets are off by two (4,5) etc.

EDIT: I've confirmed that the issue is with replace inserting more than than one character - i.e. if I replace digits (\d+) with "0" I get the same incorrect offsets, if I replace a single digit (\d) with "0" the offsets are correct.

import tokenizers
import string

UNK = "<unk>"
DIGITS = "<digits>"
special_tokens=[UNK, DIGITS]

tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE(unk_token = UNK))
tokenizer.normalizer = tokenizers.normalizers.Replace(tokenizers.Regex(r"\d+"), DIGITS)
tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Sequence([
    tokenizers.pre_tokenizers.Split(DIGITS, "isolated", False),
    tokenizers.pre_tokenizers.WhitespaceSplit(),
])

trainer = tokenizers.trainers.BpeTrainer(
    vocab_size=1000,
    special_tokens=special_tokens,
    initial_alphabet=string.ascii_lowercase
)

tokenizer.train_from_iterator(['EX 01']*100, trainer=trainer)

text = "EX 12"
for offset in tokenizer.encode(text).offsets:
    print(text[offset[0]:offset[1]])

Aug 09 '22 01:08 david-waterworth

Entirely correct !

I didn't pinpoint the issue yet, but it seems to just output the offsets of the last digit regardless of how many digits there are in the string, and only output the last digit ( s/12/123456/g to test)

Aug 23 '22 13:08 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Feb 03 '24 01:02 github-actions[bot]