tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

char_to_token is broken when is_split_into_words is set to True

Open zorikg opened this issue 3 years ago • 3 comments

Hi,

I am using LongformerTokenizerFast and char_to_token is not working properly in case I set is_split_into_words to True.

Here is a code to reproduce the issue:

import transformers
from transformers import LongformerTokenizerFast

tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', add_prefix_space=True)

print(f'transformers version: {transformers.__version__}\n')
s = "one two three"
tokenized = tokenizer(text=s.split(' '), 
                      is_split_into_words=True,
                      add_special_tokens=False)

print(f'input: \"{s}\"')
print(f'input character len: {len(s)}')
print(f"input tokens: {tokenizer.convert_ids_to_tokens(tokenized['input_ids'])}\n")

for i in range(len(s)):
  print(f'character #{i} is mapped to {tokenized.char_to_token(i)}')

Output is:

transformers version: 4.18.0.dev0

input: "one two three"
input character len: 13
input tokens: ['Ä one', 'Ä two', 'Ä three']

character #0 is mapped to 0
character #1 is mapped to 0
character #2 is mapped to 0
character #3 is mapped to 2
character #4 is mapped to 2
character #5 is mapped to None
character #6 is mapped to None
character #7 is mapped to None
character #8 is mapped to None
character #9 is mapped to None
character #10 is mapped to None
character #11 is mapped to None
character #12 is mapped to None

zorikg avatar Mar 09 '22 20:03 zorikg

Hi @zorikg ,

I will look a bit more in detail, but is there any reason you presplit your input here ? It seems like tokenizer(s) should do exactly what you want and be correct (this longformer itself will split on spaces to tokenize), no ?

is_split_into_words=True will almost always induce errors, since at the very least offsets cannot be computed correctly (we don't know what was the original string).

Narsil avatar Mar 10 '22 07:03 Narsil

Ok, I looked into it, and it seems you just need to actually send sequence_index to you char_to_token function.

for sequence_index, split in enumerate(s.split(" ")):
    for char_index, c in enumerate(split):
        print(f'character #{char_index} is mapped to {tokenized.char_to_token(char_index, sequence_index=sequence_index)}')
character #0 is mapped to 0
character #1 is mapped to 0
character #2 is mapped to 0
character #0 is mapped to 0
character #1 is mapped to 0
character #2 is mapped to 0
character #0 is mapped to 0
character #1 is mapped to 0
character #2 is mapped to 0
character #3 is mapped to 2
character #4 is mapped to 2

If you want to understand mappings between your original string and the actual tokens you get, I recommend using offsets instead.

tokenized = tokenizer(text=s, add_special_tokens=False, return_offsets_mapping=True)
for input_id, offsets in zip(tokenizer["input_ids"], tokenized["offset_mapping"]:
    start, stop = offsets
    print(f"Token {input_id} was gotten from {s[start:stop]}")

That way you can recover even ids that are not mapped anywhere in the string (like zero-width tokens if you will, they are usually special). You can also see gaps that where ignored in the original text, and it will also prevent any normalization from getting in your way since you're really looking into the original string.

char_token_to_token is basically a reverse lookup for this, and in my personal usage I seem to always find using offsets be more natural.

Narsil avatar Mar 10 '22 07:03 Narsil

Thanks @Narsil for looking into this!

Please note that in your example the characters of the word two seemed to be mapped to token at index 0 (instead of 1). From this documentation I am not sure that sequence_index is the solution (it says that it is useful only if we have a pair of sequences).

As for your first comment, I do need the is_split_into_words=True functionality for my use case, I just tried to provide simple example in this discussion. I know that we can't know what was the original string, but it could be still nice if I could map somehow between the characters of each word (after the split) to the relevant token. Do you have any other ideas of how can we achieve that?

p.s - I also wanted to use offsets_mapping in the beginning, but found it's behavior strange as well when is_split_into_words=True, for example in our case we will get [(0, 3), (0, 3), (0, 5)].

zorikg avatar Mar 10 '22 21:03 zorikg

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 21 '24 01:02 github-actions[bot]