tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Incorrect offsets after replace with special token

Open david-waterworth opened this issue 3 years ago • 1 comments

I'm not sure if this is related to https://github.com/huggingface/tokenizers/issues/892 - the code below replaces digits with the special <digits> token. The tokens and ids are correct but the offsets of the <digits> is off by one. I don't believe this is a general problem with normalizers.Replace - when I replace punctuation with " " it works, also when I introduce whitespace based on a zero-length regex split (i.e. (?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z]) for camel case). The issue I'm seeing only seems to occur with the special character, where the replaced text is greater than length one (I suspect it's probably just related to the latter).

The error seems to be related to the length of the replaced string - if the length is one (i.e. EX1) then the offset is correct (2,3), if two (ie. EX12) then the offsets are off by one (3,4) if 3 (i.e. EX123) then the offsets are off by two (4,5) etc.

EDIT: I've confirmed that the issue is with replace inserting more than than one character - i.e. if I replace digits (\d+) with "0" I get the same incorrect offsets, if I replace a single digit (\d) with "0" the offsets are correct.

import tokenizers
import string

UNK = "<unk>"
DIGITS = "<digits>"
special_tokens=[UNK, DIGITS]

tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE(unk_token = UNK))
tokenizer.normalizer = tokenizers.normalizers.Replace(tokenizers.Regex(r"\d+"), DIGITS)
tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Sequence([
    tokenizers.pre_tokenizers.Split(DIGITS, "isolated", False),
    tokenizers.pre_tokenizers.WhitespaceSplit(),
])

trainer = tokenizers.trainers.BpeTrainer(
    vocab_size=1000,
    special_tokens=special_tokens,
    initial_alphabet=string.ascii_lowercase
)

tokenizer.train_from_iterator(['EX 01']*100, trainer=trainer)

text = "EX 12"
for offset in tokenizer.encode(text).offsets:
    print(text[offset[0]:offset[1]])

david-waterworth avatar Aug 09 '22 01:08 david-waterworth

Entirely correct !

I didn't pinpoint the issue yet, but it seems to just output the offsets of the last digit regardless of how many digits there are in the string, and only output the last digit ( s/12/123456/g to test)

Narsil avatar Aug 23 '22 13:08 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 03 '24 01:02 github-actions[bot]