tokenizers
tokenizers copied to clipboard
XLM-Roberta offset mapping is off by one in case of whitespace-subwords
If a sentence is tokenized with the XLM-Roberta fast tokenizer, the offset mapping is 1 off if one of the subwords is only a space. Example:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large', use_fast=True) tokenizer.tokenize('Quality of work is sufficient') ['▁Quality', '▁of', '▁work', '▁is', '▁', 'sufficient'] tokenizer.encode_plus('Quality of work is sufficient', return_offsets_mapping=True) {'input_ids': [0, 124604, 111, 4488, 83, 6, 129980, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 7), (8, 10), (11, 15), (16, 18), (19, 20), (19, 29), (0, 0)]}
The third-last offset-tuple (19,20) overlaps with the second-last offset-tuple(19,29). I believe that this should be (18,19), and thus refer to the whitespace.
Note that this issue was first raised in the transformers library (https://github.com/huggingface/transformers/issues/17454)
Hi @robvanderg ,
This is normal so to speak with how the tokenizer was configured (which we can debate ofc).
This tokenizer, uses a space splitting which eats up the spaces (so they are not visible by the tokenizer anymore).
It then uses Metaspace which readds the extra _ to all the space splitted pieces.
The tokenizer then sees "_sufficient" as one item. (remember _ is not the original whitespace but the extra added _ so it does not come from the original string)
It tries to tokenize it and the unigram algorithm decides it will use the tokens "_" + "sufficient" (I am guessing the token _sufficient doesn't exist and other combinations like "_suff" + "icient" are more energy than the previous combinations.
Now, the offsets should technically be (19, 19) (zero width), (19, 29) for this particular config.
I can't remember this particular bug at this point in time, but basically it was tricky to change for some reason. I think it's because Metaspace doesn't track if it adds the prefix space or not. So the bug only exhibits when the token "_" is used and not in most other cases (like _Quality for instance).
It's definitely a bug, I already tracked it down a few month past, and it was tricky to fix without touching some other functionning component (and I have limited time, where even pushing a new version is delayed because I mostly work on tokenizers in my spare time).
Another note, is maybe this tokenizer should be configured differently and NOT eat up all the spaces, but if it was done this way there must have been reasons as to how the original implementation works. One common thing I know, is that some implementations remove all duplicate spaces, which is tricky to do currently with tokenizers (because we need to keep track offsets which most implementations don't, and we also try to keep in considerations odd utf-8 spaces and things like that).
@SaulLu @ydshieh for visibility
Hi @Narsil ,
Thanks for the detailed reply! I understand that it is a tricky case to fix. Would it be save to assume that any overlapping tuples, where the first is of length 1 are similar cases?
ps. I have already implemented an (inefficient) alignment from the original text to the subwords (which was also non-trivial) that seems to work, so for my use case this can be considered resolved
Thanks for the detailed reply! I understand that it is a tricky case to fix. Would it be save to assume that any overlapping tuples, where the first is of length 1 are similar cases?
Actually no, under utf-8 normalization rules, you can also have the same span of text, that generates multiple tokens (In my experience it's rare, but I have seen it).
Let's imagine there's 1 char 4bytes wide utf-8 for an accented character that get renormalized under NKFC to a 2 bytes wide accent char + a 1 byte wide letter char. And we have a byte level tokenizer that happens to miss the combination but have the individual chars.
Then the result will be 1 token for the accent char + 1 token for the letter, both corresponding to the same 4byte wide original char, so the span will be the same for both of them.
Relatively exotic but definitely possible.
The cleanest solution would definitely to fix the offset to actually get a zero-width span in that case, but it's likely to take time.
Otherwise I would check if (start-1: stop-1) == " " in the original string. (This only applies for tokenizer that behave this way).
It's definitely a workaround, but this heuristic seems like it should work (take care of start of string). Regardless of the heuristic, I would run on a big utf-8 heavy dataset to encounter as much caveats as early on as possible (MNLI was helpful in detecting many utf-8 weird things in the past).
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.