mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Example reading directly from gguf file

Open jbochi opened this issue 1 year ago • 2 comments

This loads all weights, config, and vocab directly from a GGUF file using https://github.com/ml-explore/mlx/pull/350

Example run:

$ python llama.py
models/tiny_llama/model.gguf
[INFO] Loading model from models/tiny_llama/model.gguf.
Press enter to start generation
------
In the beginning the Universe was created.


Verse 2:
The Universe was created with a great big bang.

Chorus:
The Universe is a vast and endless space.

Verse 3:
The Universe is full of mysteries to explore.

Chorus:
The Universe is a vast and endless space.

Bridge:
The Universe is a vast and endless space.

Chorus:
The Universe is
------
[INFO] Prompt processing: 0.202 s
[INFO] Full generation: 3.413 s

jbochi avatar Jan 04 '24 14:01 jbochi

Hey hey! @jbochi - Very cool! IMO it'd be very cool if people would directly be able to download the weights from the HF Hub. The weights could potentially reside in the MLX-Community org as well.

Vaibhavs10 avatar Jan 04 '24 15:01 Vaibhavs10

@jbochi I pushed a substantial change here. I moved the example to be almost the same as hf_llm for the sake of consistency and keeping the option open for future merge-ability. Make sure you pull before you make any more changes.

awni avatar Jan 10 '24 00:01 awni

Hi great work @jbochi! I'm trying to get this to work by pulling your forked repo and running it. I'm able to download the model but it's not working after this step for me.

(base) karthikkannan@M1-MBP gguf_llm % python generate.py \
  --repo TheBloke/Mistral-7B-v0.1-GGUF \
  --gguf mistral-7b-v0.1.Q6_K.gguf \
  --prompt "hello"
mistral-7b-v0.1.Q6_K.gguf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.94G/5.94G [04:40<00:00, 21.2MB/s]
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [04:41<00:00, 281.71s/it]
[INFO] Loading model from /Users/karthikkannan/.cache/huggingface/hub/models--TheBloke--Mistral-7B-v0.1-GGUF/snapshots/d4ae605152c8de0d6570cf624c083fa57dd0d551/mistral-7b-v0.1.Q6_K.gguf
Traceback (most recent call last):
  File "/Users/karthikkannan/Work/gguf_mlx-examples/mlx-examples/llms/gguf_llm/generate.py", line 85, in <module>
    model, tokenizer = models.load(args.gguf, args.repo)
  File "/Users/karthikkannan/Work/gguf_mlx-examples/mlx-examples/llms/gguf_llm/models.py", line 333, in load
    weights = mx.load(gguf_file)
RuntimeError: [load] Invalid header in file /Users/karthikkannan/.cache/huggingface/hub/models--TheBloke--Mistral-7B-v0.1-GGUF/snapshots/d4ae605152c8de0d6570cf624c083fa57dd0d551/mistral-7b-v0.1.Q6_K.gguf

I'm using python 3.10.0 and running Ventura. My laptop's an M1 Mac with 32 GB RAM. Can run llama.cpp just fine.

Shoshin23 avatar Jan 11 '24 11:01 Shoshin23

@Shoshin23 Thanks for reporting this. I suspect you are still using an older version of mlx because I can't find this error message in the source.

Now that https://github.com/ml-explore/mlx/pull/350 has been merged, you can build mlx directly from main. Can you please run pip install . @ head, and then run python -c 'import mlx.core as mx; print(mx.__version__)'?

You should see something like Successfully installed mlx-0.0.7.dev2024111+c92a134 after pip install and 0.0.7.dev2024111+c92a134 for the second command.

jbochi avatar Jan 11 '24 11:01 jbochi

Yay! Got it to work after your comment. Exciting stuff!

Shoshin23 avatar Jan 11 '24 12:01 Shoshin23

@awni , do you still plan to make any changes to this? Or should we wait for https://github.com/ml-explore/mlx/pull/426 ?

jbochi avatar Jan 11 '24 20:01 jbochi

@jbochi I don't think we need to wait until https://github.com/ml-explore/mlx/pull/426. I will double check this and we can merge it today!

awni avatar Jan 11 '24 22:01 awni

Awesome! Thank you

jbochi avatar Jan 11 '24 22:01 jbochi

@awni, I can't find gguf related python code in the current examples directory. I did find it here: https://github.com/ml-explore/mlx-examples/tree/9804e1155c961c864971a5db750030ebcc3db8f2/llms/gguf_llm

But the models.py has pdb trace code in it, and reporting error: gguf_llm/models.py", line 190, in get_string_array_field return [bytes(f.parts[d]).decode("utf-8") for d in f.data] ^^^^^^ AttributeError: 'NoneType' object has no attribute 'data'

gladjoyhub avatar Jan 15 '24 06:01 gladjoyhub

What model are you using, @gladjoyhub ? Looks like it doesn't have tokenizer.ggml.tokens or tokenizer.ggml.merges, so the tokenizer cannot be initialized. I only tested this code with tinyllama.

jbochi avatar Jan 15 '24 06:01 jbochi

What model are you using, @gladjoyhub ? Looks like it doesn't have tokenizer.ggml.tokens or tokenizer.ggml.merges, so the tokenizer cannot be initialized. I only tested this code with tinyllama.

@jbochi TheBloke/SOLAR-10.7B-Instruct-v1.0-GGUF the q8 file

Also, it would be nice if lora.py can load gguf as base model too.

gladjoyhub avatar Jan 15 '24 07:01 gladjoyhub

The relevant PR is #222 from @jbochi . So far I've tested it with a TinyLlama and Mistral model from TheBloke and it worked ago, but indeed I do not think it will support arbitrary GGUF files yet.

awni avatar Jan 15 '24 14:01 awni

Would love some pointers to output in GGUF. I am trying to get a mlx lora finetuned model conversion to gguf for use in Ollama. Is there a example somewhere to write to GGUF?

shreyaskarnik avatar Jan 17 '24 06:01 shreyaskarnik

Would love some pointers to output in GGUF. I am trying to get a mlx lora finetuned model conversion to gguf for use in Ollama. Is there a example somewhere to write to GGUF?

There is a save_gguf function that you can call. You just need to pass the file path and a map of arrays. However, it doesn't support quantized weights or metadata yet.

jbochi avatar Jan 17 '24 07:01 jbochi

@jbochi this is working now for Mistral and TinyLlama with native quantization. Let's merge it after we merge https://github.com/ml-explore/mlx/pull/426

awni avatar Jan 22 '24 21:01 awni

@jbochi this is working now for Mistral and TinyLlama with native quantization. Let's merge it after we merge ml-explore/mlx#426

Amazing! Thanks for making it happen!

jbochi avatar Jan 22 '24 21:01 jbochi

Thank YOU for making it happen!

awni avatar Jan 22 '24 22:01 awni