tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

DebertaV2TokenizerFast and XLMRobertaTokenizerFast has overlapping offsets/CharSpans which leads to char_to_token() pointing to unexpected token

Open ligz08 opened this issue 8 months ago • 1 comments

Noticed this when working with transformers.DebertaV2TokenizerFast and XLMRobertaTokenizerFast.

My transformers and tokenizers versions

import transformers, tokenizers
print(f'{transformers.__version__=}')
print(f'{tokenizers.__version__=}')
transformers.__version__='4.51.3'
tokenizers.__version__='0.21.1'

Repro in Python

samples = ['English language', '中文 中文 中文', '日本語 日本語 日本語', 'русский язык', '한국어 한국어 한국어', 'français français français', '中 文 中 文']
tokenizer = AutoTokenizer.from_pretrained('microsoft/mdeberta-v3-base')
print(f'{type(tokenizer)=}')
encodings = tokenizer(samples)
for i in range(len(samples)):
    print(f'{i=}: {samples[i]}')
    tokens = encodings.tokens(i)
    print(f'{tokens=}')
    print(f'{encodings._encodings[i].offsets=}')
    chars = []
    for j in range(len(tokens)):
        charspan = encodings.token_to_chars(i, j)
        if charspan is not None:
            chars.append((j, encodings['input_ids'][i][j], charspan, samples[i][charspan.start:charspan.end]))
        else:
            chars.append((j, encodings['input_ids'][i][j], charspan, None))
    print(f'{chars=}')
    print(f'{encodings.char_to_token(i, 0)=}')
    print(f'{encodings.token_to_chars(i, encodings.char_to_token(i, 0))=}')
    print(f'{encodings["input_ids"][i][encodings.char_to_token(i, 0)]=}')
    print()

outputs:

type(tokenizer)=<class 'transformers.models.deberta_v2.tokenization_deberta_v2_fast.DebertaV2TokenizerFast'>
i=0: English language
tokens=['[CLS]', '▁English', '▁language', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 7), (7, 16), (0, 0)]
chars=[(0, 1, None, None), (1, 5414, CharSpan(start=0, end=7), 'English'), (2, 17897, CharSpan(start=7, end=16), ' language'), (3, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=7)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=5414

i=1: 中文 中文 中文
tokens=['[CLS]', '▁', '中文', '▁', '中文', '▁', '中文', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 1), (0, 2), (2, 3), (3, 5), (5, 6), (6, 8), (0, 0)]
chars=[(0, 1, None, None), (1, 260, CharSpan(start=0, end=1), '中'), (2, 18885, CharSpan(start=0, end=2), '中文'), (3, 260, CharSpan(start=2, end=3), ' '), (4, 18885, CharSpan(start=3, end=5), '中文'), (5, 260, CharSpan(start=5, end=6), ' '), (6, 18885, CharSpan(start=6, end=8), '中文'), (7, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=1)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=260

i=2: 日本語 日本語 日本語
tokens=['[CLS]', '▁', '日本語', '▁', '日本語', '▁', '日本語', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 1), (0, 3), (3, 4), (4, 7), (7, 8), (8, 11), (0, 0)]
chars=[(0, 1, None, None), (1, 260, CharSpan(start=0, end=1), '日'), (2, 30906, CharSpan(start=0, end=3), '日本語'), (3, 260, CharSpan(start=3, end=4), ' '), (4, 30906, CharSpan(start=4, end=7), '日本語'), (5, 260, CharSpan(start=7, end=8), ' '), (6, 30906, CharSpan(start=8, end=11), '日本語'), (7, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=1)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=260

i=3: русский язык
tokens=['[CLS]', '▁', 'русский', '▁язык', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 1), (0, 7), (7, 12), (0, 0)]
chars=[(0, 1, None, None), (1, 260, CharSpan(start=0, end=1), 'р'), (2, 86154, CharSpan(start=0, end=7), 'русский'), (3, 11184, CharSpan(start=7, end=12), ' язык'), (4, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=1)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=260

i=4: 한국어 한국어 한국어
tokens=['[CLS]', '▁', '한국어', '▁', '한국어', '▁', '한국어', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 1), (0, 3), (3, 4), (4, 7), (7, 8), (8, 11), (0, 0)]
chars=[(0, 1, None, None), (1, 260, CharSpan(start=0, end=1), '한'), (2, 61330, CharSpan(start=0, end=3), '한국어'), (3, 260, CharSpan(start=3, end=4), ' '), (4, 61330, CharSpan(start=4, end=7), '한국어'), (5, 260, CharSpan(start=7, end=8), ' '), (6, 61330, CharSpan(start=8, end=11), '한국어'), (7, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=1)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=260

i=5: français français français
tokens=['[CLS]', '▁français', '▁français', '▁français', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 8), (8, 17), (17, 26), (0, 0)]
chars=[(0, 1, None, None), (1, 30326, CharSpan(start=0, end=8), 'français'), (2, 30326, CharSpan(start=8, end=17), ' français'), (3, 30326, CharSpan(start=17, end=26), ' français'), (4, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=8)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=30326

i=6: 中 文 中 文
tokens=['[CLS]', '▁', '中', '▁', '文', '▁', '中', '▁', '文', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 1), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (0, 0)]
chars=[(0, 1, None, None), (1, 260, CharSpan(start=0, end=1), '中'), (2, 1224, CharSpan(start=0, end=1), '中'), (3, 260, CharSpan(start=1, end=2), ' '), (4, 4566, CharSpan(start=2, end=3), '文'), (5, 260, CharSpan(start=3, end=4), ' '), (6, 1224, CharSpan(start=4, end=5), '中'), (7, 260, CharSpan(start=5, end=6), ' '), (8, 4566, CharSpan(start=6, end=7), '文'), (9, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=1)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=260

What appears strange

With English and French (i=0 or 5), the leading _ is part of the first non-special token, e.g. '▁English', and the offsets/CharSpans (other than the special (0,0)/None) are not overlapping -- like [(0, 0), (0, 7), (7, 16), (0, 0)]. However when Chinese/Japanese/Korean/Russian is involved, first non-special token would be a standalone '_', and there would be overlapping offsets/CharSpans, like (0, 1) and (0, 2) in the i=1 (Chinese) example.

Why it matters

Suppose I want to align information extracted from a deep tokenizer like DebertaV2TokenizerFast with information associated with a naive split-by-space tokenization. Using i=1: text='中文 中文 中文' as example, for the first "naive token" '中文' at CharSpan(start=0,end=2), to look up its Deberta token, I'd call encodings.char_to_token(0) where 0 is start of the CharSpan, and get token index 1 which corresponds to the '_' token and input_id=260, however this token does not represent '中文' at CharSpan(start=0,end=2) at all!

My questions

  • Is it expected behavior that non-special offsets/CharSpans could overlap?
  • Why does this only happen to some languages? I guess non-ASCII alphabet is more likely impacted?
  • Is this a plausible workaround?
itoken = encodings.char_to_token(0)
if encodings.token_to_chars(itoken).start == encodings.token_to_chars(itoken + 1).start:
  itoken += 1
input_id = encodings['input_ids'][itoken]
...

ligz08 avatar May 06 '25 07:05 ligz08

Hey! Sorry for the late answer 🤗 Never really spent a lot of time with these tokenizers but yeah non european languages are usually a bit tricky! If this is a bug in the way we compute / encode I am happy to fix but I don't think it is!

ArthurZucker avatar Jul 29 '25 13:07 ArthurZucker