Support reading tiktoken tokenizer.model file
Use existing TikTokenConverter to convert tiktoken tokenizer.model file.
Sample Usage:
model_file_name = 'tokenizer.model'
tokenizer = AutoTokenizer.from_pretrained('hf-internal-testing/Llama3-Instruct-Internal', tiktoken_file=model_file_name, from_slow=True)
- [x] add case to convert_tiktoken_tokenizer
- [x] add internal model
- [x] add test
Workflow changes
tokenization_utils_base.py': whenloading a model, the slow tokenizer is loaded first. If thetokenizer.modelfile is not SPM, then an error of typegoogle.protobuf.message.DecodeErroris thrown, or a RunTime error on loadingModelProto. So, the first step is to catch these errors relating to SPM and set thetokenizer=Falseto indicate failure.tokenization_utils_fast.py: check ifslow_tokenizer=False, if so, try to convert from tiktoken.convert_slow_tokenizer.py: useTikTokenConverterto convert.
-
Note: the reason we catch errors is because there is no way to differentiate the tokenizer.model file as SPM or TikToken with the current standards for hub files. So, we always try to convert from SPM, if we fail, we try with TikToken.
-
Rebased but still getting weird test failure:
FAILED tests/models/auto/test_processor_auto.py::ProcessorPushToHubTester::test_push_to_hub_dynamic_processor - huggingface_hub.utils._errors.HfHubHTTPError: 409 Client Error: Conflict for url: https://hub-ci.huggingface.co/api/repos/create (Request ID: Root=1-669e4d63-6b0d910d32c3a5ff1c4ce2d3;da66a4cb-6295-419e-9142-861d4b428fb2)
@ArthurZucker
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.