transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists

Open itazap opened this issue 1 year ago • 9 comments

Current status for AutoTokenizer with fast=True:

  1. checks tokenizer_config.json if tokenizer_class name ends with Fast
  2. if not, load a slow tokenizer (This PR):
(unchanged) 1. checks tokenizer_config.json if tokenizer_class name ends with Fast
2. if not, check if repo has a tokenizer.json file
2.1 if yes, load PreTrainedTokenizerFast
3. if not, load a slow tokenizer

prereq for https://github.com/huggingface/transformers/pull/29969

itazap avatar Sep 27 '24 08:09 itazap

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker SigLip will be the first model to be able to test this (first model where we infer PreTrainedTokenizerFast without it being specified in the tokenizer.json), I can add the test to the SigLip PR https://github.com/huggingface/transformers/pull/29969 to avoid creating an internal model, wdyt?

Edit: added the test to the SigLip PR here: https://github.com/huggingface/transformers/pull/29969/files#r1784784751

itazap avatar Sep 30 '24 08:09 itazap

@ArthurZucker the test is here in the Silgip PR: https://github.com/huggingface/transformers/pull/29969/files#r1784784751 (copy pasted below)

        # Model does not have a fast tokenizer or PreTrainedTokenizerFast specified in config but can still load fast
        tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-224", use_fast=True)
        self.assertEqual(type(tokenizer), PreTrainedTokenizerFast)

No, this test cannot be added to this PR because Siglip is not merged. Siglip would be the first model that is possible to test here. It is testing that when a fast tokenizer or PreTrainedTokenizerFast is not specified in the tokenizer config file, it can still load PreTrainedTokenizerFast.

In order to test this test in this PR, we can make an internal siglip model but I think it would be redundant since we are intending to merge siglip soon

itazap avatar Oct 05 '24 10:10 itazap

Okay, I mean we can still add a model on the hub that would not have a tokenizer_config.json -> load a pretrained tokenizer fast still

ArthurZucker avatar Oct 05 '24 12:10 ArthurZucker

Okay, it's just that IMO what we should do sounds simple: if we are not able to go through the normal route, but we have a tokeknizer.json -> just load PreTrainedTokenizerFast.

ArthurZucker avatar Oct 05 '24 12:10 ArthurZucker

@ArthurZucker this feature requires that PreTrainedTokenizerFast is an option for a given model in this specific dictionary: https://github.com/huggingface/transformers/pull/29969/files#r1792386118

So if we create a private testing model on the hub without the PreTrainedTokenizerFast class specified in the config, we would still have to modify tokenization_auto to add PreTrainedTokenizerFast to the value. Unless we change this feature to always try Fast but then it will never fallback to slow (which we currently do - the fallback to slow) if it doesn't work because the error would happen way too far in.

Also - no longer have rights to create models for HF org on the hub 😢

itazap avatar Oct 08 '24 19:10 itazap

Ah okay let me add you on the org!

ArthurZucker avatar Oct 10 '24 12:10 ArthurZucker

Added you back! 🤗

ArthurZucker avatar Oct 16 '24 12:10 ArthurZucker

  1. checks tokenizer_config.json if tokenizer_class name ends with Fast
  2. 
if not, check if repo has a tokenizer.json file 2.1 if yes, load PreTrainedTokenizerFast

  3. if not, load a slow tokenizer

This is exactly what I want!

Unless we change this feature to always try Fast but then it will never fallback to slow (which we currently do - the fallback to slow) if it doesn't work because the error would happen way too far in.

IMO let's just aime for simplicity, if we can't load anything (no tokenizer_config.json) then we just load the tokenizer.json that's it, it's the only thing to add !

Let's merge siglip this one will follow!

ArthurZucker avatar Oct 16 '24 18:10 ArthurZucker