torchtune
torchtune copied to clipboard
[Question] what to do when model doesn't have `tokenizer.model`?
while tokenizer.model is required in yaml config, but there are many models that doesn't have tokenizer.model (example: unsloth/Llama-3.2-1B)
In these cases, how can we use tokenizer.json or tokenizer_config.json that are included in almost all model instead of tokenizer.model?
In your case specifically, you can use the original Llama 3.2 1B tokenizer.model from https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct (if the unsloth version is based off the instruct model, use the base one otherwise). If unsloth modified any of the special tokens, then you will need a new tokenizer.model.
I don't believe you can load in the tokenizer without the tokenizer.model file, because it contains the BPE encoding itself.
@RdoubleA Thanks for explain, got the case.
I list some other random models that doesn't have a tokenizer.model:
deepseek-ai/DeepSeek-V3 Qwen/QVQ nvidia/Llama-3.1-Nemotron openai/gpt2 mistralai/Mistral-Nemo CohereForAI/c4ai facebook/opt-125m
I don't have any idea to what should be done here.
@joecummings @RdoubleA I have faced this while working on Phi4 PR. There are several solutions about It, but would love to get comments from you firstly.
So if I understand correctly, this is basically a function of torchtune not integrating with the Hugging Face tokenizers library, correct? In most of the examples listed above, I believe there are tokenizer.json and tokenizer_config.json files that are used by HF to build the tokenizer. I think we could consider building a utility to parse a given HF tokenizer and wrap into a format that is compatible with torchtune. This would require a fair bit of discussion though as there are a lot of details we'd need to iron out. cc @joecummings @RdoubleA for your thoughts
@krammnic Took a look at your PR. I agree we need a better solution here. We are working on integrating with HF better so it's easier to port over new models, tokenizers being a major pain point. A few options:
- We build a converter that takes in the
tokenizer_config.jsonfrom HF and queries thetokenizer_class. For a small subset of very common classes, we map to the analogue in torchtune and load a defaulttokenizer.model. For Phi4, it would be GPT2Tokenizer (we don't have an analogue for this, it could be TikToken but not sure) (see https://huggingface.co/microsoft/phi-4/blob/main/tokenizer_config.json#L779) - We build a converter that takes the entire mapping in
tokenizer.jsonfrom HF and builds the tokenizer from scratch. I'm not sure what abstractions are needed to support this, but it would remove the need to keep adding supported HF tokenizer classes
The other thing to consider is, once a new model tokenizer is added we don't need to "convert" from HF anymore because users can just instantiate the added model tokenizer. Or maybe we'll just need to load from some base tokenizer.model each time.
Open to other solutions.