tokenizers Issues with offset

Issues with offset_mapping values

Open matus-pikuliak opened this issue 3 years ago • 7 comments

Hi guys, I am trying to work with a FairSeq model converted to 🤗 but I have some issues with tokenizer. I am trying to fine-tune it for POS tagging so I have the text split to words already and I want to use the offset_mapping to detect first token for each word. I do it like this:

tokenizer = RobertaTokenizerFast.from_pretrained('path', add_prefix_space=True)
ids = tokenizer([['drieme', 'drieme']],
    is_split_into_words=True,
    padding=True,
    return_offsets_mapping=True)

The tokenization looks like this:

['<s>', 'Ġd', 'rieme', 'Ġd', 'rieme', '</s>']

But the output from the command looks like this:

{
  'input_ids': [[0, 543, 24209, 543, 24209, 2]],
  'attention_mask': [[1, 1, 1, 1, 1, 1]],
  'offset_mapping': [[(0, 0), (0, 1), (1, 6), (1, 1), (1, 6), (0, 0)]]
}

Notice the offset mapping for the second word. First word has mappings (0, 1) and (1, 6). This looks reasonable, however the second word is (1, 1) and (1, 6). Suddenly, there is 1 at the first position. This 1 is there for all but first word for any sentence I try to parse. I feel like it might have something to do with handling the start of the sentence vs all the other words, but I am not sure how to solve this issue so that I have proper offset mappings.

Apr 10 '21 08:04 matus-pikuliak

I confirm that return_offsets_mapping with is_split_into_words is confusing. It would be beneficial (especially while using max_length and stride) if offsets_mapping have also token indexes.

Now it is hard to map subtokens to tokens:

text='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
tokenized_tokens=tokenizer(text.split(), max_length=10, return_overflowing_tokens=True, stride=5, is_split_into_words=True, return_offsets_mapping=True)
tokenized_tokens['offset_mapping']
[[(0, 0), (0, 2), (2, 5), (0, 1), (1, 4), (4, 5), (0, 2), (2, 5), (0, 2), (0, 0)], [(0, 0), (1, 4), (4, 5), (0, 2), (2, 5), (0, 2), (2, 3), (0, 1), (1, 4), (0, 0)], [(0, 0), (2, 5), (0, 2), (2, 3), (0, 1), (1, 4), (4, 5), (0, 3), (3, 5), (0, 0)], [(0, 0), (0, 1), (1, 4), (4, 5), (0, 3), (3, 5), (5, 6), (6, 8), (8, 11), (0, 0)], [(0, 0), (0, 3), (3, 5), (5, 6), (6, 8), (8, 11), (0, 3), (3, 6), (6, 8), (0, 0)], [(0, 0), (6, 8), (8, 11), (0, 3), (3, 6), (6, 8), (8, 10), (0, 4), (4, 5), (0, 0)], [(0, 0), (3, 6), (6, 8), (8, 10), (0, 4), (4, 5), (0, 2), (2, 3), (0, 2), (0, 0)], [(0, 0), (0, 4), (4, 5), (0, 2), (2, 3), (0, 2), (0, 1), (1, 2), (2, 3), (0, 0)], [(0, 0), (2, 3), (0, 2), (0, 1), (1, 2), (2, 3), (3, 6), (6, 7), (0, 3), (0, 0)], [(0, 0), (1, 2), (2, 3), (3, 6), (6, 7), (0, 3), (3, 6), (0, 2), (2, 4), (0, 0)], [(0, 0), (6, 7), (0, 3), (3, 6), (0, 2), (2, 4), (4, 6), (6, 7), (7, 9), (0, 0)], [(0, 0), (0, 2), (2, 4), (4, 6), (6, 7), (7, 9), (9, 10), (0, 2), (0, 2), (0, 0)], [(0, 0), (6, 7), (7, 9), (9, 10), (0, 2), (0, 2), (2, 5), (5, 6), (0, 2), (0, 0)], [(0, 0), (0, 2), (0, 2), (2, 5), (5, 6), (0, 2), (0, 3), (3, 6), (0, 3), (0, 0)], [(0, 0), (5, 6), (0, 2), (0, 3), (3, 6), (0, 3), (3, 5), (0, 3), (3, 5), (0, 0)], [(0, 0), (3, 6), (0, 3), (3, 5), (0, 3), (3, 5), (5, 6), (6, 7), (0, 1), (0, 0)], [(0, 0), (0, 3), (3, 5), (5, 6), (6, 7), (0, 1), (1, 2), (0, 1), (1, 4), (0, 0)], [(0, 0), (6, 7), (0, 1), (1, 2), (0, 1), (1, 4), (0, 2), (0, 4), (4, 5), (0, 0)], [(0, 0), (0, 1), (1, 4), (0, 2), (0, 4), (4, 5), (0, 2), (2, 6), (6, 7), (0, 0)], [(0, 0), (0, 4), (4, 5), (0, 2), (2, 6), (6, 7), (0, 2), (2, 4), (0, 2), (0, 0)], [(0, 0), (2, 6), (6, 7), (0, 2), (2, 4), (0, 2), (2, 6), (6, 7), (0, 2), (0, 0)], [(0, 0), (2, 4), (0, 2), (2, 6), (6, 7), (0, 2), (2, 4), (4, 6), (6, 8), (0, 0)], [(0, 0), (6, 7), (0, 2), (2, 4), (4, 6), (6, 8), (8, 12), (0, 2), (2, 5), (0, 0)], [(0, 0), (4, 6), (6, 8), (8, 12), (0, 2), (2, 5), (5, 7), (0, 2), (2, 5), (0, 0)], [(0, 0), (0, 2), (2, 5), (5, 7), (0, 2), (2, 5), (5, 7), (0, 2), (2, 4), (0, 0)], [(0, 0), (0, 2), (2, 5), (5, 7), (0, 2), (2, 4), (0, 2), (0, 3), (3, 6), (0, 0)], [(0, 0), (0, 2), (2, 4), (0, 2), (0, 3), (3, 6), (6, 7), (0, 2), (0, 1), (0, 0)], [(0, 0), (0, 3), (3, 6), (6, 7), (0, 2), (0, 1), (1, 2), (0, 3), (3, 6), (0, 0)], [(0, 0), (0, 2), (0, 1), (1, 2), (0, 3), (3, 6), (6, 7), (0, 3), (3, 5), (0, 0)], [(0, 0), (0, 3), (3, 6), (6, 7), (0, 3), (3, 5), (5, 7), (7, 9), (9, 10), (0, 0)], [(0, 0), (0, 3), (3, 5), (5, 7), (7, 9), (9, 10), (0, 2), (2, 4), (0, 3), (0, 0)], [(0, 0), (7, 9), (9, 10), (0, 2), (2, 4), (0, 3), (3, 4), (0, 1), (1, 3), (0, 0)], [(0, 0), (2, 4), (0, 3), (3, 4), (0, 1), (1, 3), (3, 5), (0, 2), (2, 5), (0, 0)], [(0, 0), (0, 1), (1, 3), (3, 5), (0, 2), (2, 5), (0, 2), (0, 5), (5, 8), (0, 0)], [(0, 0), (0, 2), (2, 5), (0, 2), (0, 5), (5, 8), (8, 11), (11, 13), (0, 2), (0, 0)], [(0, 0), (0, 5), (5, 8), (8, 11), (11, 13), (0, 2), (0, 2), (2, 4), (4, 7), (0, 0)], [(0, 0), (11, 13), (0, 2), (0, 2), (2, 4), (4, 7), (7, 9), (0, 1), (1, 5), (0, 0)], [(0, 0), (2, 4), (4, 7), (7, 9), (0, 1), (1, 5), (0, 2), (2, 4), (0, 2), (0, 0)], [(0, 0), (0, 1), (1, 5), (0, 2), (2, 4), (0, 2), (2, 3), (3, 6), (0, 3), (0, 0)], [(0, 0), (2, 4), (0, 2), (2, 3), (3, 6), (0, 3), (3, 6), (0, 2), (0, 2), (0, 0)], [(0, 0), (3, 6), (0, 3), (3, 6), (0, 2), (0, 2), (2, 4), (4, 6), (0, 1), (0, 0)], [(0, 0), (0, 2), (0, 2), (2, 4), (4, 6), (0, 1), (1, 3), (3, 5), (0, 3), (0, 0)], [(0, 0), (4, 6), (0, 1), (1, 3), (3, 5), (0, 3), (3, 5), (5, 8), (8, 9), (0, 0)], [(0, 0), (3, 5), (0, 3), (3, 5), (5, 8), (8, 9), (0, 2), (2, 5), (5, 7), (0, 0)], [(0, 0), (5, 8), (8, 9), (0, 2), (2, 5), (5, 7), (7, 9), (0, 3), (3, 4), (0, 0)], [(0, 0), (2, 5), (5, 7), (7, 9), (0, 3), (3, 4), (0, 2), (2, 4), (4, 5), (0, 0)], [(0, 0), (0, 3), (3, 4), (0, 2), (2, 4), (4, 5), (5, 7), (7, 8), (0, 2), (0, 0)], [(0, 0), (2, 4), (4, 5), (5, 7), (7, 8), (0, 2), (2, 4), (4, 6), (6, 9), (0, 0)], [(0, 0), (7, 8), (0, 2), (2, 4), (4, 6), (6, 9), (0, 3), (0, 3), (3, 7), (0, 0)], [(0, 0), (4, 6), (6, 9), (0, 3), (0, 3), (3, 7), (7, 8), (8, 9), (0, 3), (0, 0)], [(0, 0), (0, 3), (3, 7), (7, 8), (8, 9), (0, 3), (3, 4), (0, 2), (0, 3), (0, 0)], [(0, 0), (8, 9), (0, 3), (3, 4), (0, 2), (0, 3), (3, 5), (0, 2), (2, 3), (0, 0)], [(0, 0), (0, 2), (0, 3), (3, 5), (0, 2), (2, 3), (0, 1), (1, 4), (4, 7), (0, 0)], [(0, 0), (0, 2), (2, 3), (0, 1), (1, 4), (4, 7), (0, 2), (2, 4), (4, 7), (0, 0)], [(0, 0), (1, 4), (4, 7), (0, 2), (2, 4), (4, 7), (7, 8), (0, 3), (3, 6), (0, 0)], [(0, 0), (2, 4), (4, 7), (7, 8), (0, 3), (3, 6), (0, 1), (1, 4), (0, 2), (0, 0)], [(0, 0), (0, 3), (3, 6), (0, 1), (1, 4), (0, 2), (0, 3), (0, 2), (2, 5), (0, 0)], [(0, 0), (1, 4), (0, 2), (0, 3), (0, 2), (2, 5), (5, 7), (7, 8), (0, 0)]]

Jan 12 '22 16:01 djstrong

@djstrong what is confusing ?

You are giving a list of strings (.split()) so offsets are given in reference of your already split input. If you want to know which original string it is, you need to go in order in your offsets and see whenever start < last_stop to know when we're looking at a new token. It is a bit contrived, but the library isn't intended to be used that way, the splitting should be handled automatically by the pre_tokenizer.

What would you expect to receive instead ?

Jan 13 '22 16:01 Narsil

It is more tricky using max_length and stride. Here is the solution for mapping subtokens to tokens:

token_index=-1
for offset_mappings, input_ids in zip(tokenized_tokens['offset_mapping'], tokenized_tokens['input_ids']):
  print(offset_mappings)
  tmp=[]
  for (start, end), input_id in zip(offset_mappings, input_ids):
    if start==0 and end==0:
      continue
    if start==0:
      token_index+=1
    tmp.append(token_index)
    print(start, end, token_index, tokens[token_index][start:end], tokenizer.convert_ids_to_tokens([input_id]))
  token_index=tmp[-stride-1]
  print()

The third column is token_index of original text:

[(0, 0), (0, 2), (2, 5), (0, 1), (1, 4), (4, 5), (0, 2), (2, 5), (0, 2), (0, 0)]
0 2 0 Lo ['Lo']
2 5 0 rem ['rem</w>']
0 1 1 i ['i']
1 4 1 psu ['psu']
4 5 1 m ['m</w>']
0 2 2 do ['do']
2 5 2 lor ['lor</w>']
0 2 3 si ['si']

[(0, 0), (1, 4), (4, 5), (0, 2), (2, 5), (0, 2), (2, 3), (0, 1), (1, 4), (0, 0)]
1 4 1 psu ['psu']
4 5 1 m ['m</w>']
0 2 2 do ['do']
2 5 2 lor ['lor</w>']
0 2 3 si ['si']
2 3 3 t ['t</w>']
0 1 4 a ['a']
1 4 4 met ['met</w>']
...

My usecase is a simple TokenClassification for long texts. I need to map subtokens to tokens to assign labels. I would like to have also token_index beside start and stop for each subtoken.

It is a bit contrived, but the library isn't intended to be used that way, the splitting should be handled automatically by the pre_tokenizer.

What do you mean? text.split()? Text is usually pretokenized in TokenClassification tasks.

Jan 13 '22 19:01 djstrong

What do you mean? text.split()? Text is usually pretokenized in TokenClassification tasks.

No it isn't. For token-classification you can try using the pipeline directly which should work on any model.

from transformers import pipeline

pipe = pipeline(model="roberta-base")
pipe("Lorem ipsum dolor sit amet, consectetur adipiscing elit")
# [[{'entity': 'LABEL_1',
#  'score': 0.6173451,
#  'index': 1,
#  'word': 'L',
#  'start': 0,
#  'end': 1},  ..........]

It doesn't work on long text with striding right now, but there are open issues for it and if you're looking to make a contribution there, it would be greatly appreciated. The biggest concern would be conflict resolution when elements of the stride don't agree.

It is more tricky using max_length and stride. Here is the solution for mapping subtokens to tokens:

You can use striding with the tokenizer on its own.

tokenizer("Lorem ipsum", max_length=4, stride=1, return_overflowing_tokens=True, return_offsets_mapping=True)
# {'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]],
# 'input_ids': [[0, 574, 43375, 2],
#               [0, 43375, 1437, 2],
#               [0, 1437, 7418, 2],
#               [0, 7418, 783, 2]],
# 'offset_mapping': [[(0, 0), (0, 1), (1, 5), (0, 0)],
#                    [(0, 0), (1, 5), (6, 6), (0, 0)],
#                    [(0, 0), (6, 6), (6, 9), (0, 0)],
#                    [(0, 0), (6, 9), (9, 11), (0, 0)]],
# 'overflow_to_sample_mapping': [0, 0, 0, 0]}

As you can see, all offsets are linked to the original string, you don't have to send pretokenized input. This is what the intended way is. (That is why I mentionned the .split() was contrived).

My usecase is a simple TokenClassification for long texts. I need to map subtokens to tokens to assign labels. I would like to have also token_index beside start and stop for each subtoken.

Sorry to insist, but I must be frank, there are no subtokens vs tokens. Something is a token or it is not. A token, is the integer number (called id usually) associated with any portion of the original text (can be zero-width, or they can overlap). It is by very definition, uncuttable into new pieces.

What you probably are thinking about is "word" vs "tokens", and that "tokens" are pieces of "words". This is a misconception. First of all, not all tokenizers even have the concept of a "word", some treat incoming string as a package and chunk it without ever looking at "words". Then, "words" is actually quite ill-defined when looked with sufficient scrutiny (as any concept really but I digress). What people call "words" are usually whitespace separated portions of text and that's just a way to "pretokenize" your text to basically help the tokenizer find boundaries which make sense for your language. There are many languages where whitespace isn't used, Chinese is a great example. Even in English whitespace is not necessarily enough ("new-york", "hello!"), German can concatenate "words" to make new ones.

So whitespace splitting is just a way to force tokenizers into limiting their tokens in such a way that they always stop at the space boundary. This is bias inputted into the tokenizer (and a requirement for speed for BPE for instance) and we internally call this bias "pre_tokenizers" (since usually they work by creating artificial boundaries, hence precutting the original text, but they aren't tokens yet.)

Sorry for the rant, and I don't mean to be rude to you, this misconception is extremely common so I try to clarify this every opportunity I get.

Jan 20 '22 09:01 Narsil

I know the difference and it is IMHO not a misconception, but different definitions of words/tokens/subtokens. Whitespace splitting was just an example to simply create pretokenized example. So for me tokens are fragments of texts (usually words and punctuation marks) and those tokens can be further tokenized to subtokens (integer numbers). In TokenClassification we usually provide pretokenized texts, especially in training process. I don't know why you disagree with it. Look at datasets (most of them are pretokenized): https://huggingface.co/datasets/conll2003

So I have to or I want to provide pretokenized text.

Jan 20 '22 12:01 djstrong

And can you do the following ?

all_ids = []
offsets = []
for (token_id, token) in enumerate(sentence.split()):
    encoded = tokenizer(token)
    all_ids.append(encoded.ids)
    for (start, stop) in encoded.offsets:
        offsets.append((token_id, start, stop))

I know the difference and it is IMHO not a misconception, but different definitions of words/tokens/subtokens. Whitespace splitting was just an example to simply create pretokenized example. So for me tokens are fragments of texts (usually words and punctuation marks) and those tokens can be further tokenized to subtokens (integer numbers).

Ok, from this library's vocabulary then what you call token would be a sentence (long or not long, both are seen as a single string) and subtoken a token.

In TokenClassification we usually provide pretokenized texts, especially in training process. I don't know why you disagree with it. Look at datasets (most of them are pretokenized): https://huggingface.co/datasets/conll2003

I know, but I stand that these are pretty powerful biases introduced. Anyway from this library's perspective, every bias should be introduced through the use of pre_tokenizer, if you can by using something like tokenizer.pre_tokenizer = pre_tokenizers.Sequence([pre_tokenizers.Whitespace(), pre_tokenizers.Punctuation()]) and then using the tokenizer directly on the full sentence. There's also the custom one that could be used.

class CamelCasePretok:                                                             
    def get_state(self, c):                                                         
        if c.islower():                                                            
            return "lower"                                                         
        elif c.isupper():                                                          
            return "upper"                                                         
        elif c.isdigit():                                                          
            return "digit"                                                         
        else:                                                                      
            return "rest"                                                          
                                                                                  
    def split(self, n, normalized):                                                
        i = 0                                                                      
        # states = {"any", "lower", "upper", "digit", "rest"}                      
        state = "any"                                                              
        pieces = []                                                                
        for j, c in enumerate(normalized.normalized):                              
            c_state = self.get_state(c)                                            
            if state == "any":                                                     
                state = c_state                                                    
            if state != "rest" and state == c_state:                               
                pass                                                               
            elif state == "upper" and c_state == "lower":                          
                pass                                                               
            else:                                                                  
                pieces.append(normalized[i:j])                                     
                i = j                                                              
            state = c_state                                                        
        pieces.append(normalized[i:])                                              
        return pieces                                                              
                                                                                   
    def pre_tokenize(self, pretok):                                                
        pretok.split(self.split)                                                   
                                                                                     
tokenizer.pre_tokenizer = PreTokenizer.custom(CamelCasePretok())            
# Using the tokenizer on "ThisIsATest", should yield something like ["This", "Is", "A", "Test"] (with maybe further splitting)

Please not that custom pretokenizers cannot be saved in .json file so need to be applied manually on each load

Jan 20 '22 14:01 Narsil

Thank you. I can tokenize each "my token" separately, but one call to tokenizer should be faster and I would have to implement max_length and stride by myself - this is the way I am doing it right now, but I thought tokenizer will do that for me faster.

Jan 20 '22 16:01 djstrong

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Apr 06 '24 01:04 github-actions[bot]

tokenizers tokenizers copied to clipboard

Issues with offset_mapping values

tokenizers
tokenizers copied to clipboard