mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

About the consistency of using `tokenizer.model` instead of `AutoTokenizer` with `use_fast=False`

Open alvarobartt opened this issue 1 year ago • 2 comments

Hi here 🤗 After playing around a bit with mlx and some of the examples under mlx-examples I was wondering whether using the tokenizer.model i.e. the SentencePieceTokenizer model file, instead of loading the tokenizers via AutoTokenizer with use_fast=False is better, because AFAIK as some examples are relying on a single file only i.e. the tokenizer.model, some configuration details may be missing like e.g. special_tokens that may have been added on top of the default tokenizer. Maybe that's not really relevant for showcasing the examples and I guess it's better in terms of sharing MLX models to have the tokenizer.model within the same repository in the HuggingFace Hub, but was wondering whether there's any downside on this approach that may lead to conflicts with the generation process itself. Thanks in advance!

alvarobartt avatar Dec 21 '23 09:12 alvarobartt

It would seem to be an issue for me, but I'll still consider this serendipity 😂

I was getting RuntimeError: Internal: src/sentencepiece_processor.cc(1103) [model_proto->ParseFromArray(serialized.data(), serialized.size())] for my Deepseek-Coder-33b model off of HF. Just looked up what the issue was related to and fortunately there was a GH issue thread detailing just that.

I guess this means I can just modify that piece of the puzzle in the Llama file and it'll work? Will report back!

BuildBackBuehler avatar Dec 21 '23 09:12 BuildBackBuehler

Sigh, well, I give up, that was working to a good degree, just had some trouble first with the tokenizer file, kept trying to use HF remote repo data in lieu of the file path given. Then got that squared away, I believe by using assert from_pretrained=True. (Had use_fast=False in the actual loader of my tokenizer path).

But once that got squared away, had issues with the config.json afterwards. I think that's largely due to deviating from a typical Llama config.json(?)

https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/blob/main/config.json

This is what my janky llama.py file ended up looking like. Of course, tried a ton of combinations but I don't know ML technical details, so it was futile. Had no idea which config values correlated.

llama copy.txt

BuildBackBuehler avatar Dec 21 '23 11:12 BuildBackBuehler

I recommend using the hf_llm example it uses AutoTokenizer and should more cleanly manage tokenization in general. We are moving other examples towards using AutoTokenizer as well.

awni avatar Jan 10 '24 15:01 awni