llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Attempting to merge with alpaca-lora and its quantization

Open taiyou2000 opened this issue 1 year ago • 18 comments

I was attempting to merge alpaca-lora from https://huggingface.co/tloen/alpaca-lora-7b and the original llama-7B from https://huggingface.co/decapoda-research/llama-7b-hf, also tried to quantize the model and run main file in llama.cpp. The merge code is from https://github.com/clcarwin/alpaca-weight

It was almost successful until final phase to run the main file in llam.cpp. I had no problems with merge and quantization.

Then it raised an error like this:

llama_model_load: llama_model_load: unknown tensor 'model.embed_tokens.weight' in model file main: failed to load model from './models/7B/ggml-model-q4_0.bin'

I will share my logs in my repository. The code I used in colab to merge and quantize the model is there too: https://github.com/taiyou2000/personal_experimant

I'm not machine learning expert and I have not checked entire llama.cpp code, but in my theory maybe the quantized model contains weights and some of them has names that main.cpp doesn't expect to see. As you can see in quantization_log.txt and pth_to_ggml_log.txt from my repository, it has names like "model.layers.0.self_attn.q_proj.weight", and probably it should be like "model.layers.0.attention.wq.weight" for main.cpp. I can run llama.cpp without any problems on my local computer and the model is quantized from torrent version. I guess huggingface version has something different from it.

taiyou2000 avatar Mar 15 '23 18:03 taiyou2000

I think this is because the model is in HF format, I ran into the same issue after fine-tuning LLaMA 7B on the Alpaca dataset.

nebulatgs avatar Mar 15 '23 18:03 nebulatgs

If anyone would like to collaborate on making the HF model work with this repo, please email me, or respond to this comment!

nebulatgs avatar Mar 15 '23 18:03 nebulatgs

I think the issue is that the tokens are embedded in the model file whereas your code does not have tokens embedded. @ggerganov could you confirm? Still a case to integrate sentencepiece.

beiller avatar Mar 15 '23 19:03 beiller

I was comparing parameters of two models. I noticed that maybe renaming parameters just works fine. But I don't know which parameters are corresponding to one another. And I think the transformers of HF model is somewhat "inverse" of torrent model because torrent model has a layer named output while HF one has a layer named input in each layer. Will it be easy as renaming parameters, or do I need to code from scratch?

HF model parameters: layers.0.self_attn.q_proj.weight layers.0.self_attn.k_proj.weight layers.0.self_attn.v_proj.weight layers.0.self_attn.o_proj.weight layers.0.self_attn.rotary_emb.inv_freq layers.0.mlp.gate_proj.weight layers.0.mlp.down_proj.weight layers.0.mlp.up_proj.weight layers.0.input_layernorm.weight layers.0.post_attention_layernorm.weight norm.weight lm_head.weight

torrent model parameters: norm.weight output.weight layers.0.attention.wq.weight layers.0.attention.wk.weight layers.0.attention.wv.weight layers.0.attention.wo.weight layers.0.feed_forward.w1.weight layers.0.feed_forward.w2.weight layers.0.feed_forward.w3.weight layers.0.attention_norm.weight layers.0.ffn_norm.weight

taiyou2000 avatar Mar 15 '23 21:03 taiyou2000

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

tloen avatar Mar 16 '23 00:03 tloen

Thank you so much @tleon. I was trying to convert state_dict too but struggling to know the unpermute function. I'm gonna give it a try soon!

taiyou2000 avatar Mar 16 '23 01:03 taiyou2000

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

Instruction: What should I have for dinner?
Output: 153 grams of beef, plus rice and peas. And you need to eat all the food! [end of text]

Wow, that script is a much more straight forward approach than the rabbit hole I was going down. Nice work.

eous avatar Mar 16 '23 02:03 eous

I just tried alpaca-lora merged model with quantization. The result was not that good as examples introduced in tloen repo. It might be price of quantization or merge was actually unsuccessful. Maybe I should modify config in llama.cpp? Anyway thank you everyone.

taiyou2000 avatar Mar 16 '23 03:03 taiyou2000

I just tried alpaca-lora merged model with quantization. The result was not that good as examples introduced in tloen repo. It might be price of quantization or merge was actually unsuccessful. Maybe I should modify config in llama.cpp? Anyway thank you everyone.

Yeh, quantization wasn't great but running it with mixed fp16/fp32 gave expected performance.

eous avatar Mar 16 '23 04:03 eous

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

Ha! You've got almost the same code for turning HF back to llama format as me :)

And I just got a thought. With the 4-bit GQPT quantized 7B/13B/... model in HF format one could unpack it to float-16 and turn it into a llama model and the requantize it with llama.cpp-quantize which would preserve the quantization hopefully.

thement avatar Mar 16 '23 20:03 thement

So we now have https://github.com/antimatter15/alpaca.cpp but it only has endless chat-mode. Someone needs to merge everything together with this repo so it can be run with dalai

totoCZ avatar Mar 16 '23 22:03 totoCZ

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

Anyway, here's a script that also does unquantization of 4bit models so then can be requantized later (but would work only with q4_1 and with fix that the min/max is calculated over the whole row, not just the QK=32 large batch)

https://gist.github.com/thement/90be08be928e0489aba7ecae9f0b352a

If you think this is useful I can maybe upgrade the convert_pth script.

thement avatar Mar 16 '23 22:03 thement

Anyway, here's a script that also does unquantization of 4bit models so then can be requantized later (but would work only with q4_1 and with fix that the min/max is calculated over the whole row, not just the QK=32 large batch)

@thement wait, what? It can losslessly roundtrip?

namliz avatar Mar 18 '23 02:03 namliz

Could we load LoRA with llama.cpp? Some languages are not well supported in original llama, but may be provided via LoRA.

linonetwo avatar Mar 18 '23 12:03 linonetwo

Could we load LoRA with llama.cpp? Some languages are not well supported in original llama, but may be provided via LoRA.

Yes, you just need to use the script by @tloen to obtain the pytorch weights, then convert to ggml using the steps described in the README.

Note that the output with llama.cpp is not the same as with pytorch, even without quantization. Here's a simple test I did locally with Alpaca lora:

alpaca-lora-quicksort-pytorch

Same request to llama.cpp using the converted weights (no quantization). Also modified alpaca.sh to pass similar arguments: ./main -m ./models/7BLora/ggml-model-f16.bin --color -f ./prompts/alpaca.txt -ins --top_k 40 --top_p 0.75 --temp 0.1 --repeat_penalty 1 -t 4 -n 2000 (not sure if missed something, I don't really know what most of these parameters mean). Here's the result:

image

Started well but ended up messing up later. Funny thing is that the llama.cpp implementation would have been a faster (if it was correct) since it only loops the array once.

It really got messed up when I tried with the 4-bit quantized weights:

image

Good news is that the non-quantized version is faster and used less memory than the pytorch version using CPU. Though maybe the pytorch slowdown was because it loaded the fine-tuning at runtime? @tloen might know. It will be a huge win if we can get llama.cpp to produce the same output as pytorch. @ggerganov might know more about this difference in outputs.

tarruda avatar Mar 21 '23 13:03 tarruda

Here's a comparison with GPT 3.5: image

I tried to raise "temp" to 0.7 to match that of GPT 3.5 but it resulted in a worse solution (even though the goal of sorting "in place" was good :smile: ):

image

tarruda avatar Mar 21 '23 14:03 tarruda

When you see a disparity between outputs in two important engines that should be identical, if you know how to use a debugger, it’s quite helpful to debug both in parallel and see where the numbers start diverging.

xloem avatar Mar 26 '23 23:03 xloem

I am using text-generation-webui to successfully train loras for/from llama7b (8bit). Is there any way to merge the trained lora with the llama 7b weights? My goal is to train a lora, merge, train again on something else, merge again, etc….. does someone in here maybe have an idea how to achieve the merging part?

Gitterman69 avatar Apr 10 '23 12:04 Gitterman69

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

The code doesn't work for llama-65B. Seems that the params in the code are different from 7B and 65B. How can I get the correct params?

ghqing0310 avatar May 16 '23 02:05 ghqing0310