Example reading directly from gguf file
This loads all weights, config, and vocab directly from a GGUF file using https://github.com/ml-explore/mlx/pull/350
Example run:
$ python llama.py
models/tiny_llama/model.gguf
[INFO] Loading model from models/tiny_llama/model.gguf.
Press enter to start generation
------
In the beginning the Universe was created.
Verse 2:
The Universe was created with a great big bang.
Chorus:
The Universe is a vast and endless space.
Verse 3:
The Universe is full of mysteries to explore.
Chorus:
The Universe is a vast and endless space.
Bridge:
The Universe is a vast and endless space.
Chorus:
The Universe is
------
[INFO] Prompt processing: 0.202 s
[INFO] Full generation: 3.413 s
Hey hey! @jbochi - Very cool! IMO it'd be very cool if people would directly be able to download the weights from the HF Hub. The weights could potentially reside in the MLX-Community org as well.
@jbochi I pushed a substantial change here. I moved the example to be almost the same as hf_llm for the sake of consistency and keeping the option open for future merge-ability. Make sure you pull before you make any more changes.
Hi great work @jbochi! I'm trying to get this to work by pulling your forked repo and running it. I'm able to download the model but it's not working after this step for me.
(base) karthikkannan@M1-MBP gguf_llm % python generate.py \
--repo TheBloke/Mistral-7B-v0.1-GGUF \
--gguf mistral-7b-v0.1.Q6_K.gguf \
--prompt "hello"
mistral-7b-v0.1.Q6_K.gguf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.94G/5.94G [04:40<00:00, 21.2MB/s]
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [04:41<00:00, 281.71s/it]
[INFO] Loading model from /Users/karthikkannan/.cache/huggingface/hub/models--TheBloke--Mistral-7B-v0.1-GGUF/snapshots/d4ae605152c8de0d6570cf624c083fa57dd0d551/mistral-7b-v0.1.Q6_K.gguf
Traceback (most recent call last):
File "/Users/karthikkannan/Work/gguf_mlx-examples/mlx-examples/llms/gguf_llm/generate.py", line 85, in <module>
model, tokenizer = models.load(args.gguf, args.repo)
File "/Users/karthikkannan/Work/gguf_mlx-examples/mlx-examples/llms/gguf_llm/models.py", line 333, in load
weights = mx.load(gguf_file)
RuntimeError: [load] Invalid header in file /Users/karthikkannan/.cache/huggingface/hub/models--TheBloke--Mistral-7B-v0.1-GGUF/snapshots/d4ae605152c8de0d6570cf624c083fa57dd0d551/mistral-7b-v0.1.Q6_K.gguf
I'm using python 3.10.0 and running Ventura. My laptop's an M1 Mac with 32 GB RAM. Can run llama.cpp just fine.
@Shoshin23 Thanks for reporting this. I suspect you are still using an older version of mlx because I can't find this error message in the source.
Now that https://github.com/ml-explore/mlx/pull/350 has been merged, you can build mlx directly from main. Can you please run pip install . @ head, and then run python -c 'import mlx.core as mx; print(mx.__version__)'?
You should see something like Successfully installed mlx-0.0.7.dev2024111+c92a134 after pip install and 0.0.7.dev2024111+c92a134 for the second command.
Yay! Got it to work after your comment. Exciting stuff!
@awni , do you still plan to make any changes to this? Or should we wait for https://github.com/ml-explore/mlx/pull/426 ?
@jbochi I don't think we need to wait until https://github.com/ml-explore/mlx/pull/426. I will double check this and we can merge it today!
Awesome! Thank you
@awni, I can't find gguf related python code in the current examples directory. I did find it here: https://github.com/ml-explore/mlx-examples/tree/9804e1155c961c864971a5db750030ebcc3db8f2/llms/gguf_llm
But the models.py has pdb trace code in it, and reporting error: gguf_llm/models.py", line 190, in get_string_array_field return [bytes(f.parts[d]).decode("utf-8") for d in f.data] ^^^^^^ AttributeError: 'NoneType' object has no attribute 'data'
What model are you using, @gladjoyhub ? Looks like it doesn't have tokenizer.ggml.tokens or tokenizer.ggml.merges, so the tokenizer cannot be initialized. I only tested this code with tinyllama.
What model are you using, @gladjoyhub ? Looks like it doesn't have
tokenizer.ggml.tokensortokenizer.ggml.merges, so the tokenizer cannot be initialized. I only tested this code with tinyllama.
@jbochi TheBloke/SOLAR-10.7B-Instruct-v1.0-GGUF the q8 file
Also, it would be nice if lora.py can load gguf as base model too.
The relevant PR is #222 from @jbochi . So far I've tested it with a TinyLlama and Mistral model from TheBloke and it worked ago, but indeed I do not think it will support arbitrary GGUF files yet.
Would love some pointers to output in GGUF. I am trying to get a mlx lora finetuned model conversion to gguf for use in Ollama. Is there a example somewhere to write to GGUF?
Would love some pointers to output in GGUF. I am trying to get a mlx lora finetuned model conversion to gguf for use in Ollama. Is there a example somewhere to write to GGUF?
There is a save_gguf function that you can call. You just need to pass the file path and a map of arrays. However, it doesn't support quantized weights or metadata yet.
@jbochi this is working now for Mistral and TinyLlama with native quantization. Let's merge it after we merge https://github.com/ml-explore/mlx/pull/426
@jbochi this is working now for Mistral and TinyLlama with native quantization. Let's merge it after we merge ml-explore/mlx#426
Amazing! Thanks for making it happen!
Thank YOU for making it happen!