llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Support for Loading a Subset of Tensors for LoRA Models

Open skeskinen opened this issue 1 year ago • 6 comments

Firstly, thank you for the awesome project. I'm new to LLMs so I hope this suggestion makes sense.

LoRA is a technique used to reduce the number of parameters during finetuning, that is really hitting off with the recent Alpaca stuff. In LoRA models, typically, only the weight matrices Wq and Wv are fine-tuned.

For projects shipping multiple LoRA fine-tuned models, most of the tensors remain unchanged during the fine-tuning process. Storing all weights multiple times would lead to a significant waste of storage space (e.g., ~3.5 GB of data per fine-tune for a 7B model, multiplied by the number of tasks or personalities you want to ship). Supporting the loading of a subset of tensors for LoRA models would enable efficient storage and loading of these models in llama.cpp, reducing storage space requirements, and maybe memory footprint if you wanted to keep multiple models in memory at the same time.

I propose to extend llama.cpp's functionality by adding support for loading a subset of tensors from separate .bin files. This way all the business of merging the LoRA weights would still be done in python. And also the model subset .bin files could be quantized like usual.

An alternative could be to natively support LoRA in llama.cpp. However, this approach would likely not be compatible with pre-quantization of the weights (afaict).

skeskinen avatar Mar 22 '23 16:03 skeskinen

Thank you for the useful summary of LoRA - I wasn't familiar and was wondering what it actually means. The proposed functionality sounds like something that can be achieved relatively easy in the existing framework.

Just curious - is this functionality currently available in other frameworks? Loading multiple personalities of the model in-memory with reduced storage and dynamically switching between them.

ggerganov avatar Mar 22 '23 16:03 ggerganov

@ggerganov Loras are used a lot in Stable Diffusion and in the webui version of llama aswell https://github.com/oobabooga/text-generation-webui/issues/332 (it doesn't work for the 4 bits for them atm though)

BadisG avatar Mar 22 '23 19:03 BadisG

Loading multiple personalities of the model in-memory with reduced storage and dynamically switching between them.

With Stable Diffusion loading LoRAs separately from models is very popular - there's a whole ecosystem of LoRAs distributed on places like civita. Many people end up with dozens or hundreds of LoRAs around, which is much more practical than keeping dozens of 4GB+ models. That will be even more so with LLaMA, given its larger size.

I expect this to be popular for LLaMA as well once the process for fine-tuning models gets to be more accessible.

bakkot avatar Mar 23 '23 04:03 bakkot

See related technique: https://github.com/ggerganov/llama.cpp/issues/528

redthing1 avatar Mar 26 '23 14:03 redthing1

There are already related discussions and attempts here: https://github.com/ggerganov/llama.cpp/issues/172

and an implementation (using the original LLaMA checkpoints) here: https://github.com/tloen/alpaca-lora#inference-generatepy

If Lora can be made to use with q4 it'd be an awesome feature to both text generation and chat, very much like Lora for images with Stable Diffusion.

edwios avatar Mar 29 '23 12:03 edwios

That discussion is kind of orthogonal to this feature request. alpaca-lora has the script for merging lora weights and converting back to pytorch format, the result of which can be used with llama.cpp as usual. That already works today.

skeskinen avatar Mar 29 '23 18:03 skeskinen