transformers [TBD] discrepancy regarding the tokenize method behavior - should the token correspond to the token in the vocabulary or to the initial text

[TBD] discrepancy regarding the tokenize method behavior - should the token correspond to the token in the vocabulary or to the initial text

Open SaulLu opened this issue 3 years ago • 0 comments

Environment info

- `transformers` version: 4.17.0
- Platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.12
- PyTorch version (GPU?): 1.10.0+cu111 (False)
- Tensorflow version (GPU?): 2.8.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed

Information

When adding a token to a tokenizer with a slow backend or to a tokenizer with a fast backend, if you use the AddedToken class with the lstrip=True argument, the output of the tokenize method is not the same.

This difference should be put into perspective by the fact that the encoding (the sequence of ids) is identical: the model will see the correct input.

To reproduce

from transformers import AutoTokenizer, AddedToken

def print_tokenizer_result(text, tokenizer):
    tokens = tokenizer.tokenize(text)
    print(f"tokenize method: {tokens}")

model_name = "patrickvonplaten/norwegian-roberta-base"

tokenizer_init = AutoTokenizer.from_pretrained(model_name)
tokenizer_init.save_pretrained("local_tokenizer")

model_name = "local_tokenizer"
tokenizer_s = AutoTokenizer.from_pretrained(model_name, use_fast=False)
tokenizer_f = AutoTokenizer.from_pretrained(model_name, use_fast=True)

new_token = "added_token_lstrip_false"
tokenizer_s.add_tokens(AddedToken(new_token, lstrip=True))
tokenizer_f.add_tokens(AddedToken(new_token, lstrip=True))

text = "Example with added_token_lstrip_false"
print("Output for the fast:")
print_tokenizer_result(text, tokenizer_f)
print("\nOutput for the slow:")
print_tokenizer_result(text, tokenizer_s)

Output:

Output for the fast:
tokenize method: ['Ex', 'amp', 'le', 'Ġwith', ' added_token_lstrip_false'] # Note the space at the beginning of ' added_token_lstrip_false'

Output for the slow:
tokenize method: ['Ex', 'amp', 'le', 'Ġwith', 'added_token_lstrip_false']

Mar 22 '22 15:03 SaulLu

transformers transformers copied to clipboard

[TBD] discrepancy regarding the tokenize method behavior - should the token correspond to the token in the vocabulary or to the initial text

Environment info

Information

To reproduce

transformers
transformers copied to clipboard