tokenizers
tokenizers copied to clipboard
Issues with offset_mapping values
Hi guys, I am trying to work with a FairSeq model converted to 🤗 but I have some issues with tokenizer. I am trying to fine-tune it for POS tagging so I have the text split to words already and I want to use the offset_mapping
to detect first token for each word. I do it like this:
tokenizer = RobertaTokenizerFast.from_pretrained('path', add_prefix_space=True)
ids = tokenizer([['drieme', 'drieme']],
is_split_into_words=True,
padding=True,
return_offsets_mapping=True)
The tokenization looks like this:
['<s>', 'Ä d', 'rieme', 'Ä d', 'rieme', '</s>']
But the output from the command looks like this:
{
'input_ids': [[0, 543, 24209, 543, 24209, 2]],
'attention_mask': [[1, 1, 1, 1, 1, 1]],
'offset_mapping': [[(0, 0), (0, 1), (1, 6), (1, 1), (1, 6), (0, 0)]]
}
Notice the offset mapping for the second word. First word has mappings (0, 1) and (1, 6). This looks reasonable, however the second word is (1, 1) and (1, 6). Suddenly, there is 1 at the first position. This 1 is there for all but first word for any sentence I try to parse. I feel like it might have something to do with handling the start of the sentence vs all the other words, but I am not sure how to solve this issue so that I have proper offset mappings.
I confirm that return_offsets_mapping
with is_split_into_words
is confusing. It would be beneficial (especially while using max_length
and stride
) if offsets_mapping
have also token indexes.
Now it is hard to map subtokens to tokens:
text='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
tokenized_tokens=tokenizer(text.split(), max_length=10, return_overflowing_tokens=True, stride=5, is_split_into_words=True, return_offsets_mapping=True)
tokenized_tokens['offset_mapping']
[[(0, 0), (0, 2), (2, 5), (0, 1), (1, 4), (4, 5), (0, 2), (2, 5), (0, 2), (0, 0)], [(0, 0), (1, 4), (4, 5), (0, 2), (2, 5), (0, 2), (2, 3), (0, 1), (1, 4), (0, 0)], [(0, 0), (2, 5), (0, 2), (2, 3), (0, 1), (1, 4), (4, 5), (0, 3), (3, 5), (0, 0)], [(0, 0), (0, 1), (1, 4), (4, 5), (0, 3), (3, 5), (5, 6), (6, 8), (8, 11), (0, 0)], [(0, 0), (0, 3), (3, 5), (5, 6), (6, 8), (8, 11), (0, 3), (3, 6), (6, 8), (0, 0)], [(0, 0), (6, 8), (8, 11), (0, 3), (3, 6), (6, 8), (8, 10), (0, 4), (4, 5), (0, 0)], [(0, 0), (3, 6), (6, 8), (8, 10), (0, 4), (4, 5), (0, 2), (2, 3), (0, 2), (0, 0)], [(0, 0), (0, 4), (4, 5), (0, 2), (2, 3), (0, 2), (0, 1), (1, 2), (2, 3), (0, 0)], [(0, 0), (2, 3), (0, 2), (0, 1), (1, 2), (2, 3), (3, 6), (6, 7), (0, 3), (0, 0)], [(0, 0), (1, 2), (2, 3), (3, 6), (6, 7), (0, 3), (3, 6), (0, 2), (2, 4), (0, 0)], [(0, 0), (6, 7), (0, 3), (3, 6), (0, 2), (2, 4), (4, 6), (6, 7), (7, 9), (0, 0)], [(0, 0), (0, 2), (2, 4), (4, 6), (6, 7), (7, 9), (9, 10), (0, 2), (0, 2), (0, 0)], [(0, 0), (6, 7), (7, 9), (9, 10), (0, 2), (0, 2), (2, 5), (5, 6), (0, 2), (0, 0)], [(0, 0), (0, 2), (0, 2), (2, 5), (5, 6), (0, 2), (0, 3), (3, 6), (0, 3), (0, 0)], [(0, 0), (5, 6), (0, 2), (0, 3), (3, 6), (0, 3), (3, 5), (0, 3), (3, 5), (0, 0)], [(0, 0), (3, 6), (0, 3), (3, 5), (0, 3), (3, 5), (5, 6), (6, 7), (0, 1), (0, 0)], [(0, 0), (0, 3), (3, 5), (5, 6), (6, 7), (0, 1), (1, 2), (0, 1), (1, 4), (0, 0)], [(0, 0), (6, 7), (0, 1), (1, 2), (0, 1), (1, 4), (0, 2), (0, 4), (4, 5), (0, 0)], [(0, 0), (0, 1), (1, 4), (0, 2), (0, 4), (4, 5), (0, 2), (2, 6), (6, 7), (0, 0)], [(0, 0), (0, 4), (4, 5), (0, 2), (2, 6), (6, 7), (0, 2), (2, 4), (0, 2), (0, 0)], [(0, 0), (2, 6), (6, 7), (0, 2), (2, 4), (0, 2), (2, 6), (6, 7), (0, 2), (0, 0)], [(0, 0), (2, 4), (0, 2), (2, 6), (6, 7), (0, 2), (2, 4), (4, 6), (6, 8), (0, 0)], [(0, 0), (6, 7), (0, 2), (2, 4), (4, 6), (6, 8), (8, 12), (0, 2), (2, 5), (0, 0)], [(0, 0), (4, 6), (6, 8), (8, 12), (0, 2), (2, 5), (5, 7), (0, 2), (2, 5), (0, 0)], [(0, 0), (0, 2), (2, 5), (5, 7), (0, 2), (2, 5), (5, 7), (0, 2), (2, 4), (0, 0)], [(0, 0), (0, 2), (2, 5), (5, 7), (0, 2), (2, 4), (0, 2), (0, 3), (3, 6), (0, 0)], [(0, 0), (0, 2), (2, 4), (0, 2), (0, 3), (3, 6), (6, 7), (0, 2), (0, 1), (0, 0)], [(0, 0), (0, 3), (3, 6), (6, 7), (0, 2), (0, 1), (1, 2), (0, 3), (3, 6), (0, 0)], [(0, 0), (0, 2), (0, 1), (1, 2), (0, 3), (3, 6), (6, 7), (0, 3), (3, 5), (0, 0)], [(0, 0), (0, 3), (3, 6), (6, 7), (0, 3), (3, 5), (5, 7), (7, 9), (9, 10), (0, 0)], [(0, 0), (0, 3), (3, 5), (5, 7), (7, 9), (9, 10), (0, 2), (2, 4), (0, 3), (0, 0)], [(0, 0), (7, 9), (9, 10), (0, 2), (2, 4), (0, 3), (3, 4), (0, 1), (1, 3), (0, 0)], [(0, 0), (2, 4), (0, 3), (3, 4), (0, 1), (1, 3), (3, 5), (0, 2), (2, 5), (0, 0)], [(0, 0), (0, 1), (1, 3), (3, 5), (0, 2), (2, 5), (0, 2), (0, 5), (5, 8), (0, 0)], [(0, 0), (0, 2), (2, 5), (0, 2), (0, 5), (5, 8), (8, 11), (11, 13), (0, 2), (0, 0)], [(0, 0), (0, 5), (5, 8), (8, 11), (11, 13), (0, 2), (0, 2), (2, 4), (4, 7), (0, 0)], [(0, 0), (11, 13), (0, 2), (0, 2), (2, 4), (4, 7), (7, 9), (0, 1), (1, 5), (0, 0)], [(0, 0), (2, 4), (4, 7), (7, 9), (0, 1), (1, 5), (0, 2), (2, 4), (0, 2), (0, 0)], [(0, 0), (0, 1), (1, 5), (0, 2), (2, 4), (0, 2), (2, 3), (3, 6), (0, 3), (0, 0)], [(0, 0), (2, 4), (0, 2), (2, 3), (3, 6), (0, 3), (3, 6), (0, 2), (0, 2), (0, 0)], [(0, 0), (3, 6), (0, 3), (3, 6), (0, 2), (0, 2), (2, 4), (4, 6), (0, 1), (0, 0)], [(0, 0), (0, 2), (0, 2), (2, 4), (4, 6), (0, 1), (1, 3), (3, 5), (0, 3), (0, 0)], [(0, 0), (4, 6), (0, 1), (1, 3), (3, 5), (0, 3), (3, 5), (5, 8), (8, 9), (0, 0)], [(0, 0), (3, 5), (0, 3), (3, 5), (5, 8), (8, 9), (0, 2), (2, 5), (5, 7), (0, 0)], [(0, 0), (5, 8), (8, 9), (0, 2), (2, 5), (5, 7), (7, 9), (0, 3), (3, 4), (0, 0)], [(0, 0), (2, 5), (5, 7), (7, 9), (0, 3), (3, 4), (0, 2), (2, 4), (4, 5), (0, 0)], [(0, 0), (0, 3), (3, 4), (0, 2), (2, 4), (4, 5), (5, 7), (7, 8), (0, 2), (0, 0)], [(0, 0), (2, 4), (4, 5), (5, 7), (7, 8), (0, 2), (2, 4), (4, 6), (6, 9), (0, 0)], [(0, 0), (7, 8), (0, 2), (2, 4), (4, 6), (6, 9), (0, 3), (0, 3), (3, 7), (0, 0)], [(0, 0), (4, 6), (6, 9), (0, 3), (0, 3), (3, 7), (7, 8), (8, 9), (0, 3), (0, 0)], [(0, 0), (0, 3), (3, 7), (7, 8), (8, 9), (0, 3), (3, 4), (0, 2), (0, 3), (0, 0)], [(0, 0), (8, 9), (0, 3), (3, 4), (0, 2), (0, 3), (3, 5), (0, 2), (2, 3), (0, 0)], [(0, 0), (0, 2), (0, 3), (3, 5), (0, 2), (2, 3), (0, 1), (1, 4), (4, 7), (0, 0)], [(0, 0), (0, 2), (2, 3), (0, 1), (1, 4), (4, 7), (0, 2), (2, 4), (4, 7), (0, 0)], [(0, 0), (1, 4), (4, 7), (0, 2), (2, 4), (4, 7), (7, 8), (0, 3), (3, 6), (0, 0)], [(0, 0), (2, 4), (4, 7), (7, 8), (0, 3), (3, 6), (0, 1), (1, 4), (0, 2), (0, 0)], [(0, 0), (0, 3), (3, 6), (0, 1), (1, 4), (0, 2), (0, 3), (0, 2), (2, 5), (0, 0)], [(0, 0), (1, 4), (0, 2), (0, 3), (0, 2), (2, 5), (5, 7), (7, 8), (0, 0)]]
@djstrong what is confusing ?
You are giving a list of strings (.split()
) so offsets are given in reference of your already split input.
If you want to know which original string it is, you need to go in order in your offsets and see whenever start < last_stop to know when we're looking at a new token.
It is a bit contrived, but the library isn't intended to be used that way, the splitting should be handled automatically by the pre_tokenizer
.
What would you expect to receive instead ?
It is more tricky using max_length
and stride
. Here is the solution for mapping subtokens to tokens:
token_index=-1
for offset_mappings, input_ids in zip(tokenized_tokens['offset_mapping'], tokenized_tokens['input_ids']):
print(offset_mappings)
tmp=[]
for (start, end), input_id in zip(offset_mappings, input_ids):
if start==0 and end==0:
continue
if start==0:
token_index+=1
tmp.append(token_index)
print(start, end, token_index, tokens[token_index][start:end], tokenizer.convert_ids_to_tokens([input_id]))
token_index=tmp[-stride-1]
print()
The third column is token_index of original text:
[(0, 0), (0, 2), (2, 5), (0, 1), (1, 4), (4, 5), (0, 2), (2, 5), (0, 2), (0, 0)]
0 2 0 Lo ['Lo']
2 5 0 rem ['rem</w>']
0 1 1 i ['i']
1 4 1 psu ['psu']
4 5 1 m ['m</w>']
0 2 2 do ['do']
2 5 2 lor ['lor</w>']
0 2 3 si ['si']
[(0, 0), (1, 4), (4, 5), (0, 2), (2, 5), (0, 2), (2, 3), (0, 1), (1, 4), (0, 0)]
1 4 1 psu ['psu']
4 5 1 m ['m</w>']
0 2 2 do ['do']
2 5 2 lor ['lor</w>']
0 2 3 si ['si']
2 3 3 t ['t</w>']
0 1 4 a ['a']
1 4 4 met ['met</w>']
...
My usecase is a simple TokenClassification for long texts. I need to map subtokens to tokens to assign labels. I would like to have also token_index
beside start
and stop
for each subtoken.
It is a bit contrived, but the library isn't intended to be used that way, the splitting should be handled automatically by the
pre_tokenizer
.
What do you mean? text.split()
? Text is usually pretokenized in TokenClassification tasks.
What do you mean?
text.split()
? Text is usually pretokenized in TokenClassification tasks.
No it isn't. For token-classification you can try using the pipeline directly which should work on any model.
from transformers import pipeline
pipe = pipeline(model="roberta-base")
pipe("Lorem ipsum dolor sit amet, consectetur adipiscing elit")
# [[{'entity': 'LABEL_1',
# 'score': 0.6173451,
# 'index': 1,
# 'word': 'L',
# 'start': 0,
# 'end': 1}, ..........]
It doesn't work on long text with striding right now, but there are open issues for it and if you're looking to make a contribution there, it would be greatly appreciated. The biggest concern would be conflict resolution when elements of the stride don't agree.
It is more tricky using
max_length
andstride
. Here is the solution for mapping subtokens to tokens:
You can use striding with the tokenizer
on its own.
tokenizer("Lorem ipsum", max_length=4, stride=1, return_overflowing_tokens=True, return_offsets_mapping=True)
# {'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]],
# 'input_ids': [[0, 574, 43375, 2],
# [0, 43375, 1437, 2],
# [0, 1437, 7418, 2],
# [0, 7418, 783, 2]],
# 'offset_mapping': [[(0, 0), (0, 1), (1, 5), (0, 0)],
# [(0, 0), (1, 5), (6, 6), (0, 0)],
# [(0, 0), (6, 6), (6, 9), (0, 0)],
# [(0, 0), (6, 9), (9, 11), (0, 0)]],
# 'overflow_to_sample_mapping': [0, 0, 0, 0]}
As you can see, all offsets are linked to the original string, you don't have to send pretokenized input.
This is what the intended way is. (That is why I mentionned the .split()
was contrived).
My usecase is a simple TokenClassification for long texts. I need to map subtokens to tokens to assign labels. I would like to have also
token_index
besidestart
andstop
for each subtoken.
Sorry to insist, but I must be frank, there are no subtokens vs tokens. Something is a token or it is not. A token, is the integer number (called id usually) associated with any portion of the original text (can be zero-width, or they can overlap). It is by very definition, uncuttable into new pieces.
What you probably are thinking about is "word" vs "tokens", and that "tokens" are pieces of "words". This is a misconception. First of all, not all tokenizers even have the concept of a "word", some treat incoming string as a package and chunk it without ever looking at "words". Then, "words" is actually quite ill-defined when looked with sufficient scrutiny (as any concept really but I digress). What people call "words" are usually whitespace separated portions of text and that's just a way to "pretokenize" your text to basically help the tokenizer find boundaries which make sense for your language. There are many languages where whitespace isn't used, Chinese is a great example. Even in English whitespace is not necessarily enough ("new-york", "hello!"), German can concatenate "words" to make new ones.
So whitespace splitting is just a way to force tokenizers into limiting their tokens in such a way that they always stop at the space boundary. This is bias inputted into the tokenizer (and a requirement for speed for BPE for instance) and we internally call this bias "pre_tokenizers" (since usually they work by creating artificial boundaries, hence precutting the original text, but they aren't tokens yet.)
Sorry for the rant, and I don't mean to be rude to you, this misconception is extremely common so I try to clarify this every opportunity I get.
I know the difference and it is IMHO not a misconception, but different definitions of words/tokens/subtokens. Whitespace splitting was just an example to simply create pretokenized example. So for me tokens are fragments of texts (usually words and punctuation marks) and those tokens can be further tokenized to subtokens (integer numbers). In TokenClassification we usually provide pretokenized texts, especially in training process. I don't know why you disagree with it. Look at datasets (most of them are pretokenized): https://huggingface.co/datasets/conll2003
So I have to or I want to provide pretokenized text.
And can you do the following ?
all_ids = []
offsets = []
for (token_id, token) in enumerate(sentence.split()):
encoded = tokenizer(token)
all_ids.append(encoded.ids)
for (start, stop) in encoded.offsets:
offsets.append((token_id, start, stop))
I know the difference and it is IMHO not a misconception, but different definitions of words/tokens/subtokens. Whitespace splitting was just an example to simply create pretokenized example. So for me tokens are fragments of texts (usually words and punctuation marks) and those tokens can be further tokenized to subtokens (integer numbers).
Ok, from this library's vocabulary then what you call token
would be a sentence
(long or not long, both are seen as a single string) and subtoken
a token
.
In TokenClassification we usually provide pretokenized texts, especially in training process. I don't know why you disagree with it. Look at datasets (most of them are pretokenized): https://huggingface.co/datasets/conll2003
I know, but I stand that these are pretty powerful biases introduced. Anyway from this library's perspective, every bias should be introduced through the use of pre_tokenizer
, if you can by using something like tokenizer.pre_tokenizer = pre_tokenizers.Sequence([pre_tokenizers.Whitespace(), pre_tokenizers.Punctuation()])
and then using the tokenizer directly on the full sentence. There's also the custom
one that could be used.
class CamelCasePretok:
def get_state(self, c):
if c.islower():
return "lower"
elif c.isupper():
return "upper"
elif c.isdigit():
return "digit"
else:
return "rest"
def split(self, n, normalized):
i = 0
# states = {"any", "lower", "upper", "digit", "rest"}
state = "any"
pieces = []
for j, c in enumerate(normalized.normalized):
c_state = self.get_state(c)
if state == "any":
state = c_state
if state != "rest" and state == c_state:
pass
elif state == "upper" and c_state == "lower":
pass
else:
pieces.append(normalized[i:j])
i = j
state = c_state
pieces.append(normalized[i:])
return pieces
def pre_tokenize(self, pretok):
pretok.split(self.split)
tokenizer.pre_tokenizer = PreTokenizer.custom(CamelCasePretok())
# Using the tokenizer on "ThisIsATest", should yield something like ["This", "Is", "A", "Test"] (with maybe further splitting)
Please not that custom pretokenizers cannot be saved in .json
file so need to be applied manually on each load
Thank you. I can tokenize each "my token" separately, but one call to tokenizer
should be faster and I would have to implement max_length
and stride
by myself - this is the way I am doing it right now, but I thought tokenizer
will do that for me faster.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.