llama.cpp
llama.cpp copied to clipboard
Attempting to merge with alpaca-lora and its quantization
I was attempting to merge alpaca-lora from https://huggingface.co/tloen/alpaca-lora-7b and the original llama-7B from https://huggingface.co/decapoda-research/llama-7b-hf, also tried to quantize the model and run main file in llama.cpp. The merge code is from https://github.com/clcarwin/alpaca-weight
It was almost successful until final phase to run the main file in llam.cpp. I had no problems with merge and quantization.
Then it raised an error like this:
llama_model_load: llama_model_load: unknown tensor 'model.embed_tokens.weight' in model file main: failed to load model from './models/7B/ggml-model-q4_0.bin'
I will share my logs in my repository. The code I used in colab to merge and quantize the model is there too: https://github.com/taiyou2000/personal_experimant
I'm not machine learning expert and I have not checked entire llama.cpp code, but in my theory maybe the quantized model contains weights and some of them has names that main.cpp doesn't expect to see. As you can see in quantization_log.txt and pth_to_ggml_log.txt from my repository, it has names like "model.layers.0.self_attn.q_proj.weight", and probably it should be like "model.layers.0.attention.wq.weight" for main.cpp. I can run llama.cpp without any problems on my local computer and the model is quantized from torrent version. I guess huggingface version has something different from it.
I think this is because the model is in HF format, I ran into the same issue after fine-tuning LLaMA 7B on the Alpaca dataset.
If anyone would like to collaborate on making the HF model work with this repo, please email me, or respond to this comment!
I think the issue is that the tokens are embedded in the model file whereas your code does not have tokens embedded. @ggerganov could you confirm? Still a case to integrate sentencepiece.
I was comparing parameters of two models. I noticed that maybe renaming parameters just works fine. But I don't know which parameters are corresponding to one another. And I think the transformers of HF model is somewhat "inverse" of torrent model because torrent model has a layer named output while HF one has a layer named input in each layer. Will it be easy as renaming parameters, or do I need to code from scratch?
HF model parameters: layers.0.self_attn.q_proj.weight layers.0.self_attn.k_proj.weight layers.0.self_attn.v_proj.weight layers.0.self_attn.o_proj.weight layers.0.self_attn.rotary_emb.inv_freq layers.0.mlp.gate_proj.weight layers.0.mlp.down_proj.weight layers.0.mlp.up_proj.weight layers.0.input_layernorm.weight layers.0.post_attention_layernorm.weight norm.weight lm_head.weight
torrent model parameters: norm.weight output.weight layers.0.attention.wq.weight layers.0.attention.wk.weight layers.0.attention.wv.weight layers.0.attention.wo.weight layers.0.feed_forward.w1.weight layers.0.feed_forward.w2.weight layers.0.feed_forward.w3.weight layers.0.attention_norm.weight layers.0.ffn_norm.weight
Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)
Thank you so much @tleon. I was trying to convert state_dict too but struggling to know the unpermute function. I'm gonna give it a try soon!
Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)
Instruction: What should I have for dinner?
Output: 153 grams of beef, plus rice and peas. And you need to eat all the food! [end of text]
Wow, that script is a much more straight forward approach than the rabbit hole I was going down. Nice work.
I just tried alpaca-lora merged model with quantization. The result was not that good as examples introduced in tloen repo. It might be price of quantization or merge was actually unsuccessful. Maybe I should modify config in llama.cpp? Anyway thank you everyone.
I just tried alpaca-lora merged model with quantization. The result was not that good as examples introduced in tloen repo. It might be price of quantization or merge was actually unsuccessful. Maybe I should modify config in llama.cpp? Anyway thank you everyone.
Yeh, quantization wasn't great but running it with mixed fp16/fp32 gave expected performance.
Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)
Ha! You've got almost the same code for turning HF back to llama format as me :)
And I just got a thought. With the 4-bit GQPT quantized 7B/13B/... model in HF format one could unpack it to float-16 and turn it into a llama model and the requantize it with llama.cpp-quantize which would preserve the quantization hopefully.
So we now have https://github.com/antimatter15/alpaca.cpp but it only has endless chat-mode. Someone needs to merge everything together with this repo so it can be run with dalai
Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)
Anyway, here's a script that also does unquantization of 4bit models so then can be requantized later (but would work only with q4_1 and with fix that the min/max is calculated over the whole row, not just the QK=32 large batch)
https://gist.github.com/thement/90be08be928e0489aba7ecae9f0b352a
If you think this is useful I can maybe upgrade the convert_pth script.
Anyway, here's a script that also does unquantization of 4bit models so then can be requantized later (but would work only with q4_1 and with fix that the min/max is calculated over the whole row, not just the QK=32 large batch)
@thement wait, what? It can losslessly roundtrip?
Could we load LoRA with llama.cpp? Some languages are not well supported in original llama, but may be provided via LoRA.
Could we load LoRA with llama.cpp? Some languages are not well supported in original llama, but may be provided via LoRA.
Yes, you just need to use the script by @tloen to obtain the pytorch weights, then convert to ggml using the steps described in the README.
Note that the output with llama.cpp is not the same as with pytorch, even without quantization. Here's a simple test I did locally with Alpaca lora:
Same request to llama.cpp using the converted weights (no quantization). Also modified alpaca.sh
to pass similar arguments: ./main -m ./models/7BLora/ggml-model-f16.bin --color -f ./prompts/alpaca.txt -ins --top_k 40 --top_p 0.75 --temp 0.1 --repeat_penalty 1 -t 4 -n 2000
(not sure if missed something, I don't really know what most of these parameters mean). Here's the result:
Started well but ended up messing up later. Funny thing is that the llama.cpp implementation would have been a faster (if it was correct) since it only loops the array once.
It really got messed up when I tried with the 4-bit quantized weights:
Good news is that the non-quantized version is faster and used less memory than the pytorch version using CPU. Though maybe the pytorch slowdown was because it loaded the fine-tuning at runtime? @tloen might know. It will be a huge win if we can get llama.cpp to produce the same output as pytorch. @ggerganov might know more about this difference in outputs.
Here's a comparison with GPT 3.5:
I tried to raise "temp" to 0.7 to match that of GPT 3.5 but it resulted in a worse solution (even though the goal of sorting "in place" was good :smile: ):
When you see a disparity between outputs in two important engines that should be identical, if you know how to use a debugger, it’s quite helpful to debug both in parallel and see where the numbers start diverging.
I am using text-generation-webui to successfully train loras for/from llama7b (8bit). Is there any way to merge the trained lora with the llama 7b weights? My goal is to train a lora, merge, train again on something else, merge again, etc….. does someone in here maybe have an idea how to achieve the merging part?
Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)
The code doesn't work for llama-65B. Seems that the params in the code are different from 7B and 65B. How can I get the correct params?