transformers
transformers copied to clipboard
Mismatch between CLIP fast and non-fast tokenizers
System Info
-
transformers
version: 4.26.1 - Platform: Windows-10-10.0.22621-SP0
- Python version: 3.8.1
- Huggingface_hub version: 0.11.0
- PyTorch version (GPU?): 1.12.1+cu116 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
@ArthurZucker @younesbelkada @sgugger
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Here is a minimal working example to show the mismatch:
from transformers import AutoTokenizer
tokenizer_fast = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch16', use_fast=True)
tokenizer = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch16', use_fast=False)
text = "You should've done this"
print(tokenizer(text))
print(tokenizer_fast(text))
print(tokenizer(text) == tokenizer_fast(text))
# Outputs:
# {'input_ids': [49406, 592, 1535, 262, 563, 1700, 589, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
# {'input_ids': [49406, 592, 1535, 1200, 1700, 589, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
# False
It appears to stem from the 've
token.
Expected behavior
The non-fast tokenization should match the fast tokenization (https://github.com/huggingface/tokenizers)
cc @ArthurZucker
Note: If you would like to get matching tokenizing before a fix goes in, installing ftfy
first should do it.
Initially looked to fix this specific issue around apostrophes, but it became apparent there were other potential formatting inconsistencies. For example, running the below also shows differences in things like 'll
and !!
tokenization:
from transformers import AutoTokenizer
tokenizer_fast = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch16', use_fast=True)
tokenizer = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch16', use_fast=False)
text = "A\n'll !!to?'d''d of, can't."
print(tokenizer(text))
print(tokenizer_fast(text))
print(tokenizer(text) == tokenizer_fast(text))
# Outputs:
# {'input_ids': [49406, 320, 262, 865, 256, 256, 531, 286, 262, 323, 262, 262, 323, 539, 267, 753, 262, 339, 269, 49407], ...}
# {'input_ids': [49406, 320, 1342, 748, 531, 13610, 323, 8445, 323, 539, 267, 753, 713, 269, 49407], ...}
# False
I put up a first go at fixing this. It's ready for review but not merge until we pick a change to apply more broadly
Hey! I can't reproduce the issue, when I ran your code I got:
{'input_ids': [49406, 592, 1535, 1200, 1700, 589, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [49406, 592, 1535, 1200, 1700, 589, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
I just updated to transformers-4.27.2
, but it still produces the incorrect output. Are you sure you are running the non-fast tokenizer @ArthurZucker ?
As stated above by @connor-henderson, it's probably because you have ftfy
installed, which I assume will use its own basic tokenizer.
Hey Arthur and xenova, in my case uninstalling ftfy or commenting out these import lines leads to repro, I believe since this conditional determines whether the BasicTokenizer is used for CLIPTokenizer.
I made sure that I was using fast yes 😉
Though it is our goal to have the same output from fast and non-fast, I don't really know why there is this ftfy used here. But yes, this is most probably not used in the fast
tokenization. This also means that the expected behaviour should probably be the one that normalized the text with ftfy
. This is something that is going to be hard to port to tokenizer depending on what kind of normalization is going on.
@ArthurZucker I believe it is the opposite, the mismatch happens when ftfy is not installed. (@connor-henderson correct me if I misunderstood your posts).
@ArthurZucker I believe it is the opposite, the mismatch happens when ftfy is not installed. (@connor-henderson correct me if I misunderstood your posts).
Yes, this is correct. I don't have ftfy installed, and I get the mismatch.
@sgugger yes thanks that is what I was saying.
I think this comes down to the expected behavior when using the BasicTokenizer generally. If it is supposed to match the fast tokenizer output I believe we have a bug. But if its not, and it's just expected to split naively on punctuation then I don't think we have a bug and I should close my PR
I think this PR is good for ppl who do not have ftfy
! Thanks both of you for pointing this out and will be reviewing the PR!
Commenting to mark as not stale :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.