transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Support reading tiktoken tokenizer.model file

Open itazap opened this issue 1 year ago • 1 comments

Use existing TikTokenConverter to convert tiktoken tokenizer.model file.

Sample Usage:

model_file_name = 'tokenizer.model'

tokenizer = AutoTokenizer.from_pretrained('hf-internal-testing/Llama3-Instruct-Internal', tiktoken_file=model_file_name, from_slow=True)

  • [x] add case to convert_tiktoken_tokenizer
  • [x] add internal model
  • [x] add test

Workflow changes

  1. tokenization_utils_base.py': when loading a model, the slow tokenizer is loaded first. If the tokenizer.model file is not SPM, then an error of type google.protobuf.message.DecodeError is thrown, or a RunTime error on loading ModelProto. So, the first step is to catch these errors relating to SPM and set the tokenizer=False to indicate failure.
  2. tokenization_utils_fast.py: check if slow_tokenizer=False, if so, try to convert from tiktoken.
  3. convert_slow_tokenizer.py: use TikTokenConverter to convert.
  • Note: the reason we catch errors is because there is no way to differentiate the tokenizer.model file as SPM or TikToken with the current standards for hub files. So, we always try to convert from SPM, if we fail, we try with TikToken.

  • Rebased but still getting weird test failure: FAILED tests/models/auto/test_processor_auto.py::ProcessorPushToHubTester::test_push_to_hub_dynamic_processor - huggingface_hub.utils._errors.HfHubHTTPError: 409 Client Error: Conflict for url: https://hub-ci.huggingface.co/api/repos/create (Request ID: Root=1-669e4d63-6b0d910d32c3a5ff1c4ce2d3;da66a4cb-6295-419e-9142-861d4b428fb2)

@ArthurZucker

itazap avatar Jun 27 '24 12:06 itazap

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.