DebertaV2TokenizerFast and XLMRobertaTokenizerFast has overlapping offsets/CharSpans which leads to char_to_token() pointing to unexpected token
Noticed this when working with transformers.DebertaV2TokenizerFast and XLMRobertaTokenizerFast.
My transformers and tokenizers versions
import transformers, tokenizers
print(f'{transformers.__version__=}')
print(f'{tokenizers.__version__=}')
transformers.__version__='4.51.3'
tokenizers.__version__='0.21.1'
Repro in Python
samples = ['English language', '中文 中文 中文', '日本語 日本語 日本語', 'русский язык', '한국어 한국어 한국어', 'français français français', '中 文 中 文']
tokenizer = AutoTokenizer.from_pretrained('microsoft/mdeberta-v3-base')
print(f'{type(tokenizer)=}')
encodings = tokenizer(samples)
for i in range(len(samples)):
print(f'{i=}: {samples[i]}')
tokens = encodings.tokens(i)
print(f'{tokens=}')
print(f'{encodings._encodings[i].offsets=}')
chars = []
for j in range(len(tokens)):
charspan = encodings.token_to_chars(i, j)
if charspan is not None:
chars.append((j, encodings['input_ids'][i][j], charspan, samples[i][charspan.start:charspan.end]))
else:
chars.append((j, encodings['input_ids'][i][j], charspan, None))
print(f'{chars=}')
print(f'{encodings.char_to_token(i, 0)=}')
print(f'{encodings.token_to_chars(i, encodings.char_to_token(i, 0))=}')
print(f'{encodings["input_ids"][i][encodings.char_to_token(i, 0)]=}')
print()
outputs:
type(tokenizer)=<class 'transformers.models.deberta_v2.tokenization_deberta_v2_fast.DebertaV2TokenizerFast'>
i=0: English language
tokens=['[CLS]', '▁English', '▁language', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 7), (7, 16), (0, 0)]
chars=[(0, 1, None, None), (1, 5414, CharSpan(start=0, end=7), 'English'), (2, 17897, CharSpan(start=7, end=16), ' language'), (3, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=7)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=5414
i=1: 中文 中文 中文
tokens=['[CLS]', '▁', '中文', '▁', '中文', '▁', '中文', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 1), (0, 2), (2, 3), (3, 5), (5, 6), (6, 8), (0, 0)]
chars=[(0, 1, None, None), (1, 260, CharSpan(start=0, end=1), '中'), (2, 18885, CharSpan(start=0, end=2), '中文'), (3, 260, CharSpan(start=2, end=3), ' '), (4, 18885, CharSpan(start=3, end=5), '中文'), (5, 260, CharSpan(start=5, end=6), ' '), (6, 18885, CharSpan(start=6, end=8), '中文'), (7, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=1)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=260
i=2: 日本語 日本語 日本語
tokens=['[CLS]', '▁', '日本語', '▁', '日本語', '▁', '日本語', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 1), (0, 3), (3, 4), (4, 7), (7, 8), (8, 11), (0, 0)]
chars=[(0, 1, None, None), (1, 260, CharSpan(start=0, end=1), '日'), (2, 30906, CharSpan(start=0, end=3), '日本語'), (3, 260, CharSpan(start=3, end=4), ' '), (4, 30906, CharSpan(start=4, end=7), '日本語'), (5, 260, CharSpan(start=7, end=8), ' '), (6, 30906, CharSpan(start=8, end=11), '日本語'), (7, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=1)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=260
i=3: русский язык
tokens=['[CLS]', '▁', 'русский', '▁язык', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 1), (0, 7), (7, 12), (0, 0)]
chars=[(0, 1, None, None), (1, 260, CharSpan(start=0, end=1), 'р'), (2, 86154, CharSpan(start=0, end=7), 'русский'), (3, 11184, CharSpan(start=7, end=12), ' язык'), (4, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=1)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=260
i=4: 한국어 한국어 한국어
tokens=['[CLS]', '▁', '한국어', '▁', '한국어', '▁', '한국어', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 1), (0, 3), (3, 4), (4, 7), (7, 8), (8, 11), (0, 0)]
chars=[(0, 1, None, None), (1, 260, CharSpan(start=0, end=1), '한'), (2, 61330, CharSpan(start=0, end=3), '한국어'), (3, 260, CharSpan(start=3, end=4), ' '), (4, 61330, CharSpan(start=4, end=7), '한국어'), (5, 260, CharSpan(start=7, end=8), ' '), (6, 61330, CharSpan(start=8, end=11), '한국어'), (7, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=1)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=260
i=5: français français français
tokens=['[CLS]', '▁français', '▁français', '▁français', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 8), (8, 17), (17, 26), (0, 0)]
chars=[(0, 1, None, None), (1, 30326, CharSpan(start=0, end=8), 'français'), (2, 30326, CharSpan(start=8, end=17), ' français'), (3, 30326, CharSpan(start=17, end=26), ' français'), (4, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=8)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=30326
i=6: 中 文 中 文
tokens=['[CLS]', '▁', '中', '▁', '文', '▁', '中', '▁', '文', '[SEP]']
encodings._encodings[i].offsets=[(0, 0), (0, 1), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (0, 0)]
chars=[(0, 1, None, None), (1, 260, CharSpan(start=0, end=1), '中'), (2, 1224, CharSpan(start=0, end=1), '中'), (3, 260, CharSpan(start=1, end=2), ' '), (4, 4566, CharSpan(start=2, end=3), '文'), (5, 260, CharSpan(start=3, end=4), ' '), (6, 1224, CharSpan(start=4, end=5), '中'), (7, 260, CharSpan(start=5, end=6), ' '), (8, 4566, CharSpan(start=6, end=7), '文'), (9, 2, None, None)]
encodings.char_to_token(i, 0)=1
encodings.token_to_chars(i, encodings.char_to_token(i, 0))=CharSpan(start=0, end=1)
encodings["input_ids"][i][encodings.char_to_token(i, 0)]=260
What appears strange
With English and French (i=0 or 5), the leading _ is part of the first non-special token, e.g. '▁English', and the offsets/CharSpans (other than the special (0,0)/None) are not overlapping -- like [(0, 0), (0, 7), (7, 16), (0, 0)]. However when Chinese/Japanese/Korean/Russian is involved, first non-special token would be a standalone '_', and there would be overlapping offsets/CharSpans, like (0, 1) and (0, 2) in the i=1 (Chinese) example.
Why it matters
Suppose I want to align information extracted from a deep tokenizer like DebertaV2TokenizerFast with information associated with a naive split-by-space tokenization. Using i=1: text='中文 中文 中文' as example, for the first "naive token" '中文' at CharSpan(start=0,end=2), to look up its Deberta token, I'd call encodings.char_to_token(0) where 0 is start of the CharSpan, and get token index 1 which corresponds to the '_' token and input_id=260, however this token does not represent '中文' at CharSpan(start=0,end=2) at all!
My questions
- Is it expected behavior that non-special offsets/CharSpans could overlap?
- Why does this only happen to some languages? I guess non-ASCII alphabet is more likely impacted?
- Is this a plausible workaround?
itoken = encodings.char_to_token(0)
if encodings.token_to_chars(itoken).start == encodings.token_to_chars(itoken + 1).start:
itoken += 1
input_id = encodings['input_ids'][itoken]
...
Hey! Sorry for the late answer 🤗 Never really spent a lot of time with these tokenizers but yeah non european languages are usually a bit tricky! If this is a bug in the way we compute / encode I am happy to fix but I don't think it is!