mlx-examples
mlx-examples copied to clipboard
About the consistency of using `tokenizer.model` instead of `AutoTokenizer` with `use_fast=False`
Hi here 🤗 After playing around a bit with mlx and some of the examples under mlx-examples I was wondering whether using the tokenizer.model i.e. the SentencePieceTokenizer model file, instead of loading the tokenizers via AutoTokenizer with use_fast=False is better, because AFAIK as some examples are relying on a single file only i.e. the tokenizer.model, some configuration details may be missing like e.g. special_tokens that may have been added on top of the default tokenizer. Maybe that's not really relevant for showcasing the examples and I guess it's better in terms of sharing MLX models to have the tokenizer.model within the same repository in the HuggingFace Hub, but was wondering whether there's any downside on this approach that may lead to conflicts with the generation process itself. Thanks in advance!
It would seem to be an issue for me, but I'll still consider this serendipity 😂
I was getting RuntimeError: Internal: src/sentencepiece_processor.cc(1103) [model_proto->ParseFromArray(serialized.data(), serialized.size())] for my Deepseek-Coder-33b model off of HF. Just looked up what the issue was related to and fortunately there was a GH issue thread detailing just that.
I guess this means I can just modify that piece of the puzzle in the Llama file and it'll work? Will report back!
Sigh, well, I give up, that was working to a good degree, just had some trouble first with the tokenizer file, kept trying to use HF remote repo data in lieu of the file path given. Then got that squared away, I believe by using assert from_pretrained=True. (Had use_fast=False in the actual loader of my tokenizer path).
But once that got squared away, had issues with the config.json afterwards. I think that's largely due to deviating from a typical Llama config.json(?)
https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/blob/main/config.json
This is what my janky llama.py file ended up looking like. Of course, tried a ton of combinations but I don't know ML technical details, so it was futile. Had no idea which config values correlated.
I recommend using the hf_llm example it uses AutoTokenizer and should more cleanly manage tokenization in general. We are moving other examples towards using AutoTokenizer as well.