transformers
transformers copied to clipboard
ByT5Tokenizer ignores spaces around added tokens
System Info
transformers 4.23.1
Who can help?
@patrickvonplaten @SaulLu
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('google/byt5-base')
tokenizer.add_tokens('<x>', special_tokens=True)
print(tokenizer('<x> <x> <x><x>'))
{'input_ids': [384, 384, 384, 384, 1], 'attention_mask': [1, 1, 1, 1, 1]}
in comparison to:
print(tokenizer('a a aa'))
{'input_ids': [100, 35, 100, 35, 100, 100, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
Expected behavior
In my task presence of spaces around added tokens are important. Despite that, I think byT5 tokenizer should not ignore any characters (bytes).
May also be of interest to @ArthurZucker
Also cc @Narsil - any ideas here?
Also cc @Narsil - any ideas here?
Yes, by default added tokens always use lstrip/rstrip=True
which swallows prefix/suffix spaces (it's a convenience for tokenizers
support (doesn't make sense speedwise). and using the "slow" class (it's not slow though).
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
# tokenizer.add_tokens("<x>", special_tokens=True)
new_token = AddedToken("<x>", lstrip=False, rstrip=False)
tokenizer.add_tokens(new_token, special_tokens=True)
tokenizer._additional_special_tokens.append(new_token)
This change will fix it, however it require changing internals which is not great. Definitely looks like a bug.
Pinging @ydshieh which was looking at this recently and trying to figure out some tokenizer stuff.
I "think" this qualifies as a bug. (Well the original shared code is not OK, the defaults are to strip left and right, but if you do add_tokens(AddedToken(.., lstrip=False, rstrip=False))
then it should honor that. And the workaround I had to look at a few different internal variables to set it appropriately so that the Trie
class could do it's job correctly (otherwise it just couldn't see the AddedToken
values.
Sorry for being late here. So as @Narsil pointed out,
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
new_token = AddedToken("<x>", lstrip=False, rstrip=False)
tokenizer.add_tokens(new_token, special_tokens=True)
should work (which is not the case for now) without the need of tokenizer._additional_special_tokens.append(new_token)
.
And the goal is to make the above code snippet do it job correctly. Is this right?
Hey! I'll take this one on as part of #23909, since it is an issue with rstrip
and lstrip
being ignored (as the default behaviour if a token is not special is to always stip)
As mentioned, this will take a bit more time, a big refactoring is coming! 🔥
Should be merged this week!