torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

[Question] what to do when model doesn't have `tokenizer.model`?

Open steveepreston opened this issue 11 months ago • 2 comments

while tokenizer.model is required in yaml config, but there are many models that doesn't have tokenizer.model (example: unsloth/Llama-3.2-1B)

In these cases, how can we use tokenizer.json or tokenizer_config.json that are included in almost all model instead of tokenizer.model?

steveepreston avatar Dec 29 '24 18:12 steveepreston

In your case specifically, you can use the original Llama 3.2 1B tokenizer.model from https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct (if the unsloth version is based off the instruct model, use the base one otherwise). If unsloth modified any of the special tokens, then you will need a new tokenizer.model.

I don't believe you can load in the tokenizer without the tokenizer.model file, because it contains the BPE encoding itself.

RdoubleA avatar Jan 01 '25 01:01 RdoubleA

@RdoubleA Thanks for explain, got the case. I list some other random models that doesn't have a tokenizer.model:

deepseek-ai/DeepSeek-V3 Qwen/QVQ nvidia/Llama-3.1-Nemotron openai/gpt2 mistralai/Mistral-Nemo CohereForAI/c4ai facebook/opt-125m

I don't have any idea to what should be done here.

steveepreston avatar Jan 01 '25 05:01 steveepreston

@joecummings @RdoubleA I have faced this while working on Phi4 PR. There are several solutions about It, but would love to get comments from you firstly.

krammnic avatar Jan 12 '25 23:01 krammnic

So if I understand correctly, this is basically a function of torchtune not integrating with the Hugging Face tokenizers library, correct? In most of the examples listed above, I believe there are tokenizer.json and tokenizer_config.json files that are used by HF to build the tokenizer. I think we could consider building a utility to parse a given HF tokenizer and wrap into a format that is compatible with torchtune. This would require a fair bit of discussion though as there are a lot of details we'd need to iron out. cc @joecummings @RdoubleA for your thoughts

ebsmothers avatar Jan 13 '25 18:01 ebsmothers

@krammnic Took a look at your PR. I agree we need a better solution here. We are working on integrating with HF better so it's easier to port over new models, tokenizers being a major pain point. A few options:

  • We build a converter that takes in the tokenizer_config.json from HF and queries the tokenizer_class. For a small subset of very common classes, we map to the analogue in torchtune and load a default tokenizer.model. For Phi4, it would be GPT2Tokenizer (we don't have an analogue for this, it could be TikToken but not sure) (see https://huggingface.co/microsoft/phi-4/blob/main/tokenizer_config.json#L779)
  • We build a converter that takes the entire mapping in tokenizer.json from HF and builds the tokenizer from scratch. I'm not sure what abstractions are needed to support this, but it would remove the need to keep adding supported HF tokenizer classes

The other thing to consider is, once a new model tokenizer is added we don't need to "convert" from HF anymore because users can just instantiate the added model tokenizer. Or maybe we'll just need to load from some base tokenizer.model each time.

Open to other solutions.

RdoubleA avatar Jan 14 '25 18:01 RdoubleA