llama.cpp
llama.cpp copied to clipboard
Add LoRA support
This change allows applying LoRA adapters on the fly without having to duplicate the model files.
Instructions:
- Obtain the HF PEFT LoRA files
adapter_config.json
andadapter_model.bin
of a LoRA adapter and put them in the same path. For alpaca, this can be found at https://huggingface.co/tloen/alpaca-lora-7b/tree/main - Convert it using
convert-lora-to-ggml.py
to obtainggml-adapter-model.bin
python convert-lora-to-ggml.py lora/alpaca-lora-7b
- Use the
ggml-adapter-model.bin
with--lora
./main -m models/7B/ggml-model-f16.bin --lora lora/alpaca-lora-7b/ggml-adapter-model.bin --color -f ./prompts/alpaca.txt -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7
- When using a quantized model, the quality may suffer. To avoid this, specify a f16/f32 model with
--lora-base
to use as a base. The layers modified by LoRA adapter will be applied to the lora-base model and then quantized to the same format as the model specified with-m
. Layers not modified by the LoRA adapter will remain untouched.
./main -m models/7B/ggml-model-q4_0.bin --lora lora/alpaca-lora-7b/ggml-adapter-model.bin --lora-base models/7B/ggml-model-f16.bin --color -f ./prompts/alpaca.txt -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7
Limitations:
- Using
--lora
disables mmap since the models have to be modified anyway. - When using
--lora-base
, aggml_cpy
operation is used to quantize the result, which currently is done in a single thread. Parallelizingggml_cpy
will improve loading times.
Awesome! Loras would be super useful, especially with how easy to train they're becoming right now 🔥
Do you think it is possible (or desirable) to produce a quantized versions of the patched tensors?
( f16 llama model, LoRA's tensors) --> f16 patched tensors --> quantized patched tensors
This would brings the speedups from quantization and allow to mmap both files. The pages from the original tensors won't be faulted / loaded in memory (the MAP_POPULATE
would have to be disabled)
@Piezoid I am not sure what is the best way to handle this. Ideally for simplicity, the resulting patched tensors would be in the same format as they were initially, so if you patch a q4_0 model you still end with a q4_0 model. However, that may affect the quality significantly and it may be as slow or slower than just patching the f16 model and quantizing it afterwards on the fly. We need to run more tests, I may try implementing both options to see what works best.
@slaren Like you said, adding the LoRA deltas to a q4 quantized model is most likely very bad for quality. The quantization must happen afterward. My suggestion was to generate a separate model file consisting solely of the patched tensors with the LoRA full-rank weights added, and potentially applying quantization as a final step.
The idea is to save disk space by only requiring the space for the modified tensors. By completing the patching process offline, it's possible that the load time will also decrease.
Your proposal of patching and quantizing during load time is interesting, but it necessitates loading an f16 llama model and quantizing tensors that haven't been modified. It's possible that I'm mistaken since I'm unsure which tensors are quantized and which ones are patched by LoRA.
@Piezoid it is not really viable to store the pre-patched tensors because the file size would be nearly the same than the entire model. The advantage of lora is that to patch a 4096x4096 matrix you only need a 16x4096 and a 4096x16 matrices (for rank 16, could be any other number). Patch it and suddenly your 2x16x4096 becomes 4096x4096.
Very useful info.
Another approach to think about is to use the distributive property of matrix multiplication: (B+C)A=BA+CA
We can add optional LoRA nodes to the llama
computation graph.
Examples:
cur = ggml_mul_mat(ctx0, model.layers[il].wo, cur);
would become:
curL = ggml_mul_mat(ctx0, model.layers[il].wo, cur);
if (lora_enabled) {
// can be precomputed once at the cost of more memory
// or we can keep unpacking it each time to save memory
lora = ggml_mul_mat(ctx0, model.loraB[il], model.loraA_trans[il]);
lora = ggml_mul_mat(ctx0, lora, cur); // F32 mul_mat
curL = ggml_add(ctx0, curL, lora); // F32 add
}
cur = curL;
The drawback is slower inference due to extra ggml_mul_mat
, but it would be trivial to dynamically load new LoRAs on-the-fly. And the fundamental model is unchanged and can remain quantized.
A small side-note, I realized that in some cases it will also be necessary to add a scaling factor. Specifically this what PEFT does to merge the lora:
self.scaling = self.lora_alpha / self.r
if fan_in_fan_out:
self.weight.data = self.weight.data.T
...
self.weight.data += (
transpose(self.lora_B.weight @ self.lora_A.weight, self.fan_in_fan_out) * self.scaling
)
...
def transpose(weight, fan_in_fan_out):
return weight.T if fan_in_fan_out else weight
Where lora_alpha
and r
(rank) are parameters in the adapter_model.json
.
In the case of alpaca lora_alpha = r
so this is a noop, but this is not always case, for example in gpt4all lora_alpha=32
and r=8
.
@ggerganov In addition to the performance considerations, something to keep in mind is that the tensors to apply lora to is entirely up to the implementation, for example alpaca applies to all q,k,v,o but gpt4all only to q,v. I imagine that eval would quickly turn to spaguetti if we have to consider every single tensor separately.
This should work with quantized models now. Patching quantized models doesn't seem so bad, I got a perplexity of 6.6281 on q4_0 with alpaca.
Now that #801 has been merged, using --lora
disables mmap. Loading is a bit slower but it should work on windows now.
Awesome 🔥 I'll test it on Windows soon. This feature is super useful 🙂
On Mon, Apr 10, 2023, 16:15 slaren @.***> wrote:
Now that #801 https://github.com/ggerganov/llama.cpp/pull/801 has been merged, using --lora disables mmap. Loading is a bit slower but it should work on windows now.
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/pull/820#issuecomment-1502265595, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3AH2ZXF2P27SVHP72DDXARS77ANCNFSM6AAAAAAWV5K3KM . You are receiving this because you commented.Message ID: <ggerganov/llama. @.***>
So, to be clear, we will load orig params, and then in a batched fashion:
- Load fp16 LoRA for the given matrix
- Dequantize orig params to fp16
- Apply lora
- Requantize to save memory
Any rough estimate for how long this adapter "loading" time is?
using --lora disables mmap
I guess since you may patch an arbitrary fraction of weights, the orig weights for the patched matrices are loaded but once. But mmap might still be useful for the case of relatively small fraction of weights + hot-swapping LoRAs. Just a thought.
CoW for large fraction of weights is basically duplicating the weights, so very much unviable.
Replace fp16 with fp32 and that's pretty close to the way it works at the moment:
- multiply matrices lora B and lora A in f32
- scale BA with f32
- add BAs to original weights. this is where the dequantizing/requantizing happens if necessary
The time to apply the adapter for me varies from ~5 seconds with a small lora adapter on 7B to upwards of a minute with a larger lora on 30B. The slowest part by far is multiplying the lora matrices.
There may be some ways to accelerate this, but at the moment I am more concerned with correctness and supporting all the use cases.
I'm trying to troubleshoot some issues on windows. First, the conversion script and overall process was straightforward, so good job making it simple.
I was able to load the 7B llama and 7B lora fine, but I noticed that I didn't seem to get the responses I expect with the Lora applied. This seemed odd, because it was behaving as if the lora wasn't present at all.
When I tried testing with the 13B model and 13B lora, I ran into issues when trying to run main. It mentioned not enough space in the context's memory pool
. I have 64GB system ram, and it's not close to being maxed, so I'm confused about what is happening.
C:\Users\Bradarr\Documents\GitHub\llama.cpp> ./build/bin/main -m D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin --lora D:\models\loras\bradarr-lora\13B\ggml-adapter-model.bin
main: seed = 1681243691
llama.cpp: loading model from D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: f16 = 2
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7945693.73 KB
llama_model_load_internal: mem required = 9807.47 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: ggml_new_tensor_impl: not enough space in the context's memory pool (needed 105185904, available 104857600)
Any pointers? Super pumped to get this working because it opens up a ton of possibilities! Also, just an idea, but it might be nice to have the option to fuse the lora to the base model. Once you have a lora that works really well and constantly use it, it would be nice to bundle it permanently.
edit (some additional info):
ggml_new_tensor_impl: context memory pool -> (needed 209715232, available 421527552)
ggml_new_tensor_impl: context memory pool -> (needed 419430640, available 421527552)
llama_init_from_file: kv self size = 400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: ggml_new_tensor_impl: context memory pool -> (needed 163872, available 104857600)
ggml_new_tensor_impl: context memory pool -> (needed 327920, available 104857600)
ggml_new_tensor_impl: context memory pool -> (needed 105185728, available 104857600)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 105185904, available 104857600)
@MillionthOdin16 thanks for testing this, it has been a struggle telling for sure if the lora that I had tried had any meaningful effects, but I think I found a problem. Can you see if the latest changes fixes your issues?
@MillionthOdin16 thanks for testing this, it has been a struggle telling for sure if the lora that I had tried had any meaningful effects, but I think I found a problem. Can you see if the latest changes fixes your issues?
Awesome! Memory allocation issues are fixed and now things are running smoothly.
I'm not getting the responses I'd expect lora-wise, so I suspect there is something off about how the lora is applied. Now that I can run my 13B model, it's much easier to see when the lora is working correctly (13B is my best trained lora). Is there anything I can do to help troubleshoot?
I have a lora that's 25MB that when put on the plain llama model significantly improves the output. I don't know if a lora that is fully merged into the base model might help as well (don't know if we can compare effective weights between this implantation and the lora-fused model?)
Once this works as expected it will be huge. Moving around 25MB loras vs base models is so much easier. And there's lots to be evaluated with layering loras and scaling them based off ratios :D
Are you using a f16 model? Trying to apply a lora to a quantized model may be a terrible idea after all.
You're right. The output works as expected when the llama model is f-32. Nice job!
Now I'm trying to figure out the best way to make it usable. After the model is merged completely with the lora and quantized to 4 bits, it still produces good output (my point being that eventually we will want to get these fully quantized).
So we're merging at f-32 to keep precision? I'm wondering what the best approach is for allowing this to work on quantized models. The ability to have a lora run on top of the base model in llama.cpp is in itself huge because moving significant variations of llama becomes trivial. Having a way for a user to set and lora and have it fused to the model, which could then be quantized down to 4bits would be really helpful. It's not as streamlined as realtime loading of loras, but it makes the use of loras significantly easier.
Do you have any thoughts on how quantization could be worked on in memory? Has anyone tested if a quantized lora still has a useful impact on a quantized base model?
Extra Info
This works:
PS C:\Users\Bradarr\Documents\GitHub\llama.cpp> ./build/bin/main -m "D:\models\LLaMA\13B\ggml-model-f32.bin" --lora "D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin" --interactive-first
main: seed = 1681250916
llama.cpp: loading model from D:\models\LLaMA\13B\ggml-model-f32.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 0 (all F32)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 50843293.73 KB
llama_model_load_internal: mem required = 51699.65 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: .......... done (18393.01 ms)
system_info: n_threads = 4 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0
This doesn't:
PS C:\Users\Bradarr\Documents\GitHub\llama.cpp> ./build/bin/main -m "D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin" --lora "D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin" --interactive-first
main: seed = 1681251252
llama.cpp: loading model from D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7945693.73 KB
llama_model_load_internal: mem required = 9807.47 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: .......... done (10663.88 ms)
system_info: n_threads = 4 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0
Good to hear that it is working!
Regarding creating pre-merged models, it is already possible to do that in python by using a script similar to this one from alpaca-lora that merges the lora and then exports the model as pth, which can then be converted to ggml as usual with convert-pth-to-ggml.py
. I am not sure that it is worth replicating the same feature in llama.cpp, but I am not entirely opposed to it if it can bring some convenience.
I suspect that loading the layers modified by the lora in f16 and then quantizing them back into the same format as the model may be fast enough to be practical. So you could do something like main -m models/7B/q4_0.bin --lora-base models/7B/f16.bin --lora mylora.bin
, and it would keep the unmodified layers from the q4_0 model, but any layers modified by the lora would be loaded from the f16, patched and then quantized to q4_0 or whatever is the format of the model specified in -m
.
I suspect that loading the layers modified by the lora in f16 and then quantizing them back into the same format as the model may be fast enough to be practical. So you could do something like
main -m models/7B/q4_0.bin --lora-base models/7B/f16.bin --lora mylora.bin
, and it would keep the unmodified layers from the q4_0 model, but any layers modified by the lora would be loaded from the f16, patched and then quantized to q4_0 or whatever is the format of the model specified in-m
.
Okay, I see. Just to note, I tested f-32 f-16 and q4_0 base llama models with the same lora file. f-32 was definitely lora-ized, f16 was definitely lora-ized (although I don't know how the output quality is different than f-32), and q4_0 didn't seem to have any variation resulting from the lora. Haven't checked the code to know if this is expected.
Do you think applying a quantized lora to a quantized mode might have any merit? Sometimes we get interesting results, and it would definitely be faster (assuming you want to trade the accuracy for the speed).
Regarding creating pre-merged models, it is already possible to do that in python by using a script similar to this one from alpaca-lora that merges the lora and then exports the model as pth, which can then be converted to ggml as usual with
convert-pth-to-ggml.py
. I am not sure that it is worth replicating the same feature in llama.cpp, but I am not entirely opposed to it if it can bring some convenience.
Yes, I've seen the scripts, but I think for most users the understanding of model file formats and what they currently have vs what format they need is very confusing. My thought is that loras have the ability to significantly change the model outputs, are super lightweight, and are becoming more accessible and easier to train with projects like @lxe/simple-llm-finetuner. If we are able to streamline the use of loras and conversion of a lora adapter to a ggml model format they are familiar with, we can make learning about language models much easier (abstracting away as much pytorch/GPU/heavy ML stuff as possible). I know you already know this haha, I'm just excited about how this and similar projects make very technical areas easy to play with.
Also, I've noticed a scaling factor in console and you've mentioned it some. Is this something that directly affects the 'impact' of the lora weights on the overall model? If so, it could be useful to break it out as an argument to make experimentation easier. With stable diffusion they've done some pretty cool things with mixing different lora layers (so I'm thinking about this for down the line)
f-32 was definitely lora-ized, f16 was definitely lora-ized (although I don't know how the output quality is different than f-32), and q4_0 didn't seem to have any variation resulting from the lora. Haven't checked the code to know if this is expected.
From what I understand, the llama models are natively f16, so I wouldn't expect much benefit from using a f32 model.
Do you think applying a quantized lora to a quantized mode might have any merit? Sometimes we get interesting results, and it would definitely be faster (assuming you want to trade the accuracy for the speed).
The problem with doing that is that the loras make very small modification to the weights, and the changes may be completely lost in the noise when applied to a quantizied model. Using a quantizied lora too just makes the problem worse, I don't think that would work at all.
Also, I've noticed a scaling factor in console and you've mentioned it some. Is this something that directly affects the 'impact' of the lora weights on the overall model?
This is just something that the PEFT library does based on the lora_alpha
parameter and the rank of the lora, and I don't think it should be modified at all, but who knows what effect it might have. Applying loras on top of other loras seems very suspect to me, I wouldn't expect it to work at all, but I guess in some cases it might? Anyway I would leave that experimentation to the GPU people, if they find something worthwhile we can back-port it here.
~5 seconds with a small lora adapter on 7B to upwards of a minute with a larger lora on 30B. The slowest part by far is multiplying the lora matrices.
Is this already parallelised?
Can we also add the option to reexport the resultant lora-applied quantized model to a binfile? The user may want to test out the lora loaded dynamically, but then save it afterwards so its faster to load.
I suppose the best API would be in interactive mode - which exports the currently loaded model to binfile, LoRA or not. But there should also be a direct export workflow.
Argument for direct export workflow:
- I believe that applying the lora weights using this method is more space-effective - due to requantizing the output to int4 immediately.
- For instance, I need 60GB of DRAM to apply a (non-LoRA) diff to obtain the vicuna weights due to loading and applying the entire unquantized model. I don't know if https://github.com/tloen/alpaca-lora#inference-generatepy requires significant DRAM too.
How to train a lora? Much appreciate it!
I've optimized the LoRA matmul to be 4X faster with AVX2 and 3X faster with AVX. https://github.com/ggerganov/llama.cpp/issues/956
I will contribute the optimizations once this PR has been merged.
My LoRA application time is now 5.4s. Previously it was 20s. Feels like a breeze.
@slaren what's your estimate for the amount of work needed to get this working with quantized models? Is this something we can break into smaller tasks and help with?
Also, will #951 have an impact on this?
what's your estimate for the amount of work needed to get this working with quantized models?
This does work with quantized models.
It's not a lot of work, just annoying work because the model loader isn't really designed for this use case and will take some time to figure the best way to adapt it. #951 will have some effects, but as long as the SIMD quantize functions aren't removed it will be fine.
Anyway, I appreciate the enthusiasm but remember that there are already other ways to use pre-merged LoRA adapters with llama.cpp, it will get done but none of this is really time sensitive.
This does work with quantized models.
What do you mean it works with quantized models?
The lora is applied correctly and the outputs of the model match the expected lora outputs only when the model is f16 or f32. If you try to use a quantized model with lora, the model will not respond as the lora tuned model.