transformers icon indicating copy to clipboard operation
transformers copied to clipboard

ByT5Tokenizer ignores spaces around added tokens

Open djstrong opened this issue 2 years ago • 5 comments

System Info

transformers 4.23.1

Who can help?

@patrickvonplaten @SaulLu

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('google/byt5-base')
tokenizer.add_tokens('<x>', special_tokens=True)
print(tokenizer('<x> <x> <x><x>'))
{'input_ids': [384, 384, 384, 384, 1], 'attention_mask': [1, 1, 1, 1, 1]}

in comparison to:

print(tokenizer('a a aa'))
{'input_ids': [100, 35, 100, 35, 100, 100, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Expected behavior

In my task presence of spaces around added tokens are important. Despite that, I think byT5 tokenizer should not ignore any characters (bytes).

djstrong avatar Oct 25 '22 15:10 djstrong

May also be of interest to @ArthurZucker

sgugger avatar Oct 25 '22 16:10 sgugger

Also cc @Narsil - any ideas here?

patrickvonplaten avatar Nov 30 '22 10:11 patrickvonplaten

Also cc @Narsil - any ideas here?

Yes, by default added tokens always use lstrip/rstrip=True which swallows prefix/suffix spaces (it's a convenience for so you don't have to worry how to add in within some text.) Since ByT5 is pure bytes, it doesn't have tokenizers support (doesn't make sense speedwise). and using the "slow" class (it's not slow though).

tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
# tokenizer.add_tokens("<x>", special_tokens=True)
new_token = AddedToken("<x>", lstrip=False, rstrip=False)
tokenizer.add_tokens(new_token, special_tokens=True)
tokenizer._additional_special_tokens.append(new_token)

This change will fix it, however it require changing internals which is not great. Definitely looks like a bug.

Pinging @ydshieh which was looking at this recently and trying to figure out some tokenizer stuff.

I "think" this qualifies as a bug. (Well the original shared code is not OK, the defaults are to strip left and right, but if you do add_tokens(AddedToken(.., lstrip=False, rstrip=False)) then it should honor that. And the workaround I had to look at a few different internal variables to set it appropriately so that the Trie class could do it's job correctly (otherwise it just couldn't see the AddedToken values.

Narsil avatar Nov 30 '22 14:11 Narsil

Sorry for being late here. So as @Narsil pointed out,

tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
new_token = AddedToken("<x>", lstrip=False, rstrip=False)
tokenizer.add_tokens(new_token, special_tokens=True)

should work (which is not the case for now) without the need of tokenizer._additional_special_tokens.append(new_token). And the goal is to make the above code snippet do it job correctly. Is this right?

ydshieh avatar Feb 13 '23 14:02 ydshieh

Hey! I'll take this one on as part of #23909, since it is an issue with rstrip and lstrip being ignored (as the default behaviour if a token is not special is to always stip)

ArthurZucker avatar Jun 01 '23 15:06 ArthurZucker

As mentioned, this will take a bit more time, a big refactoring is coming! 🔥

ArthurZucker avatar Jun 27 '23 12:06 ArthurZucker

Should be merged this week!

ArthurZucker avatar Aug 16 '23 08:08 ArthurZucker