mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Ability to convert a lora_fused_model to gguf format for use in LMStudio and others

Open hammer-mt opened this issue 5 months ago • 11 comments

Recently I got a flow working where I would train a model with mlx (this is new for me) and then move over to llama.cpp to do the conversion to gguf in order to run it on LMStudio locally. However with Mixtral in particular the conversion is failing, and I am not sure if it's something wrong with training or llama.cpp, or something new to do with the MoE architecture.

Here's how to reproduce (I'm on an M3 Macbook Pro). In the mlx_examples/lora folder:

python convert.py --hf-path mistralai/Mixtral-8x7B-Instruct-v0.1 -q

Which downloads the quantized model locally to /mlx_model (I have 64gb memory but I found it would crash if I don't use the quantized model).

Then I can run the following training script:

python lora.py \
 --train \
 --model mlx_model \
 --data ./data \
 --batch-size 2 \
 --lora-layers 8 \
 --iters 1000

In /data I have train.jsonl and valid.jsonl like in the examples, and I tested it worked fine on Mistral-7B-v0.1 so I know this part is fine. I'm not allowed to share the data but you could probably just run this with the existing example data to replicate.

When I run fuse with the dequantize parameter (I've tried with and without) I get the /lora_fused_model folder

python fuse.py --model mlx_model \
 --adapter-file ./adapters.npz \
 --de-quantize

Normally what I would do now is move that over to llama.cpp and use convert.py to get the f16 version:

python convert.py models/lora_fused_model --outfile models/lora_fused_model-fp16.gguf --outtype f16

Then I would quantize further to get the Q4_K_M version that I normally use in LM Studio.

./quantize ./models/lora_fused_model-fp16.gguf ./models/lora_fused_model.Q4_K_M.gguf Q4_K_M

However, when I did that I got the error Exception: Unexpected tensor name: model.layers.0.block_sparse_moe.experts.0.w1.weight.

I upgraded llama.cpp and then ran it again and got this error: "GGUFWriter" object has no attribute "add_expert_count"

I don't know if this is an issue with llama.cpp, or if I did something wrong in the training or conversion process which meant llama.cpp can't convert it? Perhaps there's a way to convert to gguf in mlx that I missed? I have seen fine-tunes of Mixtral-8x7b out there so I presume it's possible, but maybe the MoE format isn't supported yet or needs me to run a different script?

hammer-mt avatar Mar 06 '24 15:03 hammer-mt

Just to be clear the error is when you run the convert.py to get the GGUF? Where is that script?

It does look like a llama.cpp issue to me. Are you able to make a gguf from even the orginal base mixtral model https://huggingface.co/mistralai/Mistral-7B-v0.1 ? In other words, if you remove MLX from the picture entirely, I suspect it would still not work. (MLX doesn't change layer names or attributes on the model during the conversion process).

If that's the case you might want to file an issue with llama.cpp since it's better that they fix their conversion process. But if not let me know and I can repro further!

awni avatar Mar 06 '24 15:03 awni

Ah yes it's specifically when I run python convert.py models/lora_fused_model --outfile models/lora_fused_model-fp16.gguf --outtype f16.

It's this script in llama.cpp that everybody online recommends to use to convert to gguf (unaware of other ways to do this). https://github.com/ggerganov/llama.cpp/blob/master/convert.py

When I train the normal 7b model it works fine, it's just specifically with MoE. If mlx doesn't change the layer names or anything then I suspect it's a llama.cpp issue and will raise over there.

However, it would be great if you guys had your own 'convert to gguf' script because that must be a common use case given that's what text-generation-ui, olama, lm studio, etc support.

hammer-mt avatar Mar 06 '24 15:03 hammer-mt

Nice idea. We can look into something like that. It may actually not be too bad since we can already export to GGUF via MLX. It's mostly a matter of getting all the key names / metadata correct and converting the tokenizer.

Maybe we should change this issue to a feature request for something like that (assumkng the issue is unrelated to MLX)

awni avatar Mar 06 '24 15:03 awni

Yes happy to do that, what do I need to edit?

hammer-mt avatar Mar 06 '24 15:03 hammer-mt

Just the title I suppose. I will add the appropriate label.

awni avatar Mar 06 '24 15:03 awni

Very interesting, I have fine-tuned the mixtral and converted it back to gguf before, it was working fine for me.

mzbac avatar Mar 07 '24 04:03 mzbac

@awni , could you assign this to me? I am keen to take a look and see what the issue is and work on the script.

mzbac avatar Mar 07 '24 04:03 mzbac

Thanks, that'd be great!

awni avatar Mar 07 '24 04:03 awni

@hammer-mt, I am unable to replicate the error on my local machine. The fused model seems to be working fine and can be converted/quantized by llama.cpp with the latest master build. However, I am using llama.cpp on my other Ubuntu server, so there may be something related to llama.cpp setup on Mac.

By the way, I noticed that you seem to be using the LoRa from a LoRa example which is not as feature-rich as mlx_lm.lora. Maybe you can try mlx_lm.lora to see if that can fix the issue. https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md

mzbac avatar Mar 07 '24 10:03 mzbac

@hammer-mt You did try to convert using convert-hf-to-gguf.py.

I had problems converting with convert.py, then tried convert-hf-to-gguf.py and it worked.

@mzbac Not sure if this info is helpful, looking forward to you completing this feature.

madroidmaq avatar Mar 07 '24 16:03 madroidmaq

Thanks for looking into it @madroidmaq , I'll give it a try with the convert-hf-to-gguf.py and report back.

Potentially something only occuring on mac @mzbac.

I was very religiously following Andy Peatling's guide because I have no idea what I'm doing (this is only my second fine-tune, the first was using OpenAI's feature which is just an API call). https://apeatling.com/articles/simple-guide-to-local-llm-fine-tuning-on-a-mac-with-mlx/

I saw this guide recently which uses mlx_lm.lora, and will dig into that further too: https://github.com/mzbac/mlx-lora

hammer-mt avatar Mar 07 '24 20:03 hammer-mt