Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists
Current status for AutoTokenizer with fast=True:
- checks tokenizer_config.json if tokenizer_class name ends with Fast
- if not, load a slow tokenizer (This PR): (unchanged) 1. checks tokenizer_config.json if tokenizer_class name ends with Fast 2. if not, check if repo has a tokenizer.json file 2.1 if yes, load PreTrainedTokenizerFast 3. if not, load a slow tokenizer
prereq for https://github.com/huggingface/transformers/pull/29969
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@ArthurZucker SigLip will be the first model to be able to test this (first model where we infer PreTrainedTokenizerFast without it being specified in the tokenizer.json), I can add the test to the SigLip PR https://github.com/huggingface/transformers/pull/29969 to avoid creating an internal model, wdyt?
Edit: added the test to the SigLip PR here: https://github.com/huggingface/transformers/pull/29969/files#r1784784751
@ArthurZucker the test is here in the Silgip PR: https://github.com/huggingface/transformers/pull/29969/files#r1784784751 (copy pasted below)
# Model does not have a fast tokenizer or PreTrainedTokenizerFast specified in config but can still load fast
tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-224", use_fast=True)
self.assertEqual(type(tokenizer), PreTrainedTokenizerFast)
No, this test cannot be added to this PR because Siglip is not merged. Siglip would be the first model that is possible to test here. It is testing that when a fast tokenizer or PreTrainedTokenizerFast is not specified in the tokenizer config file, it can still load PreTrainedTokenizerFast.
In order to test this test in this PR, we can make an internal siglip model but I think it would be redundant since we are intending to merge siglip soon
Okay, I mean we can still add a model on the hub that would not have a tokenizer_config.json -> load a pretrained tokenizer fast still
Okay, it's just that IMO what we should do sounds simple: if we are not able to go through the normal route, but we have a tokeknizer.json -> just load PreTrainedTokenizerFast.
@ArthurZucker this feature requires that PreTrainedTokenizerFast is an option for a given model in this specific dictionary:
https://github.com/huggingface/transformers/pull/29969/files#r1792386118
So if we create a private testing model on the hub without the PreTrainedTokenizerFast class specified in the config, we would still have to modify tokenization_auto to add PreTrainedTokenizerFast to the value. Unless we change this feature to always try Fast but then it will never fallback to slow (which we currently do - the fallback to slow) if it doesn't work because the error would happen way too far in.
Also - no longer have rights to create models for HF org on the hub 😢
Ah okay let me add you on the org!
Added you back! 🤗
- checks tokenizer_config.json if tokenizer_class name ends with Fast
- if not, check if repo has a tokenizer.json file 2.1 if yes, load PreTrainedTokenizerFast
- if not, load a slow tokenizer
This is exactly what I want!
Unless we change this feature to always try Fast but then it will never fallback to slow (which we currently do - the fallback to slow) if it doesn't work because the error would happen way too far in.
IMO let's just aime for simplicity, if we can't load anything (no tokenizer_config.json) then we just load the tokenizer.json that's it, it's the only thing to add !
Let's merge siglip this one will follow!