transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Mismatch between CLIP fast and non-fast tokenizers

Open xenova opened this issue 1 year ago • 11 comments

System Info

  • transformers version: 4.26.1
  • Platform: Windows-10-10.0.22621-SP0
  • Python version: 3.8.1
  • Huggingface_hub version: 0.11.0
  • PyTorch version (GPU?): 1.12.1+cu116 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker @younesbelkada @sgugger

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Here is a minimal working example to show the mismatch:

from transformers import AutoTokenizer

tokenizer_fast = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch16', use_fast=True)
tokenizer = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch16', use_fast=False)

text = "You should've done this"

print(tokenizer(text))
print(tokenizer_fast(text))
print(tokenizer(text) == tokenizer_fast(text))

# Outputs:
# {'input_ids': [49406, 592, 1535, 262, 563, 1700, 589, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
# {'input_ids': [49406, 592, 1535, 1200, 1700, 589, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
# False

It appears to stem from the 've token.

Expected behavior

The non-fast tokenization should match the fast tokenization (https://github.com/huggingface/tokenizers)

xenova avatar Mar 14 '23 21:03 xenova

cc @ArthurZucker

younesbelkada avatar Mar 15 '23 07:03 younesbelkada

Note: If you would like to get matching tokenizing before a fix goes in, installing ftfy first should do it.

Initially looked to fix this specific issue around apostrophes, but it became apparent there were other potential formatting inconsistencies. For example, running the below also shows differences in things like 'll and !! tokenization:

from transformers import AutoTokenizer

tokenizer_fast = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch16', use_fast=True)
tokenizer = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch16', use_fast=False)

text = "A\n'll !!to?'d''d of, can't."

print(tokenizer(text))
print(tokenizer_fast(text))
print(tokenizer(text) == tokenizer_fast(text))

# Outputs:
# {'input_ids': [49406, 320, 262, 865, 256, 256, 531, 286, 262, 323, 262, 262, 323, 539, 267, 753, 262, 339, 269, 49407], ...}
# {'input_ids': [49406, 320, 1342, 748, 531, 13610, 323, 8445, 323, 539, 267, 753, 713, 269, 49407], ...}
# False

I put up a first go at fixing this. It's ready for review but not merge until we pick a change to apply more broadly

connor-henderson avatar Mar 20 '23 23:03 connor-henderson

Hey! I can't reproduce the issue, when I ran your code I got:

{'input_ids': [49406, 592, 1535, 1200, 1700, 589, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [49406, 592, 1535, 1200, 1700, 589, 49407], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

ArthurZucker avatar Mar 21 '23 08:03 ArthurZucker

I just updated to transformers-4.27.2, but it still produces the incorrect output. Are you sure you are running the non-fast tokenizer @ArthurZucker ?

As stated above by @connor-henderson, it's probably because you have ftfy installed, which I assume will use its own basic tokenizer.

xenova avatar Mar 21 '23 11:03 xenova

Hey Arthur and xenova, in my case uninstalling ftfy or commenting out these import lines leads to repro, I believe since this conditional determines whether the BasicTokenizer is used for CLIPTokenizer.

connor-henderson avatar Mar 21 '23 12:03 connor-henderson

I made sure that I was using fast yes 😉 Though it is our goal to have the same output from fast and non-fast, I don't really know why there is this ftfy used here. But yes, this is most probably not used in the fast tokenization. This also means that the expected behaviour should probably be the one that normalized the text with ftfy. This is something that is going to be hard to port to tokenizer depending on what kind of normalization is going on.

ArthurZucker avatar Mar 21 '23 14:03 ArthurZucker

@ArthurZucker I believe it is the opposite, the mismatch happens when ftfy is not installed. (@connor-henderson correct me if I misunderstood your posts).

sgugger avatar Mar 21 '23 14:03 sgugger

@ArthurZucker I believe it is the opposite, the mismatch happens when ftfy is not installed. (@connor-henderson correct me if I misunderstood your posts).

Yes, this is correct. I don't have ftfy installed, and I get the mismatch.

xenova avatar Mar 21 '23 14:03 xenova

@sgugger yes thanks that is what I was saying.

I think this comes down to the expected behavior when using the BasicTokenizer generally. If it is supposed to match the fast tokenizer output I believe we have a bug. But if its not, and it's just expected to split naively on punctuation then I don't think we have a bug and I should close my PR

connor-henderson avatar Mar 21 '23 15:03 connor-henderson

I think this PR is good for ppl who do not have ftfy! Thanks both of you for pointing this out and will be reviewing the PR!

ArthurZucker avatar Mar 27 '23 17:03 ArthurZucker

Commenting to mark as not stale :)

xenova avatar Apr 21 '23 15:04 xenova

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 10 '23 15:06 github-actions[bot]