transformers CLIPTokenizer behaves inconsistently depending on whether ftfy is installed or not

CLIPTokenizer behaves inconsistently depending on whether ftfy is installed or not

Open kjsman opened this issue 3 years ago • 1 comments

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.22.1
Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.14
Huggingface_hub version: 0.9.1
PyTorch version (GPU?): 1.12.1+cu113 (False)
Tensorflow version (GPU?): 2.8.2 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@patil-suraj

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Run following code without ftfy installed.

from transformers import CLIPTokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
tokenizer("résumé") # {'input_ids': [49406, 15077, 49407], 'attention_mask': [1, 1, 1]}

Run following code with ftfy installed.

from transformers import CLIPTokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
tokenizer("résumé") # {'input_ids': [49406, 29106, 7054, 4166, 49407], 'attention_mask': [1, 1, 1, 1, 1]}

Expected behavior

They should work consistently.

Sep 20 '22 08:09 kjsman

This happens because BasicTokenizer, which is used as fallback text fix function, strips accents if do_lower_case=True.

We may fix this by explicitly set strip_accents to False; ViT/L-14 tokenizer includes vocabs with accents, so I think stripping accents should not be done.

Sep 20 '22 08:09 kjsman

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 20 '22 15:10 github-actions[bot]

transformers transformers copied to clipboard

CLIPTokenizer behaves inconsistently depending on whether ftfy is installed or not

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard