transformers
transformers copied to clipboard
[TBD] discrepancy regarding the tokenize method behavior - should the token correspond to the token in the vocabulary or to the initial text
Environment info
- `transformers` version: 4.17.0
- Platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.12
- PyTorch version (GPU?): 1.10.0+cu111 (False)
- Tensorflow version (GPU?): 2.8.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
Information
When adding a token to a tokenizer with a slow backend or to a tokenizer with a fast backend, if you use the AddedToken class with the lstrip=True argument, the output of the tokenize method is not the same.
This difference should be put into perspective by the fact that the encoding (the sequence of ids) is identical: the model will see the correct input.
To reproduce
from transformers import AutoTokenizer, AddedToken
def print_tokenizer_result(text, tokenizer):
tokens = tokenizer.tokenize(text)
print(f"tokenize method: {tokens}")
model_name = "patrickvonplaten/norwegian-roberta-base"
tokenizer_init = AutoTokenizer.from_pretrained(model_name)
tokenizer_init.save_pretrained("local_tokenizer")
model_name = "local_tokenizer"
tokenizer_s = AutoTokenizer.from_pretrained(model_name, use_fast=False)
tokenizer_f = AutoTokenizer.from_pretrained(model_name, use_fast=True)
new_token = "added_token_lstrip_false"
tokenizer_s.add_tokens(AddedToken(new_token, lstrip=True))
tokenizer_f.add_tokens(AddedToken(new_token, lstrip=True))
text = "Example with added_token_lstrip_false"
print("Output for the fast:")
print_tokenizer_result(text, tokenizer_f)
print("\nOutput for the slow:")
print_tokenizer_result(text, tokenizer_s)
Output:
Output for the fast:
tokenize method: ['Ex', 'amp', 'le', 'Ä with', ' added_token_lstrip_false'] # Note the space at the beginning of ' added_token_lstrip_false'
Output for the slow:
tokenize method: ['Ex', 'amp', 'le', 'Ä with', 'added_token_lstrip_false']