lit-llama Apply LoRA to more Linear layers

Our current LoRA implementations applies it to just the qv computation. However, recent trends suggest there are performance improvements to gain from applying it elsewhere.

For instance, the QLoRA paper reports:

As shown in Figure 2 for LLaMA 7B finetuning on Alpaca, we find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers are required to match full finetuning performance

I've seen other online practitioners also apply it to the lm_head and MLP. But I don't have any sources to cite about whether that's better or worse

May 31 '23 22:05 carmocca

In the section 7.1 of LoRA paper authors compared less LoRA layers with higher rank versus more layers with smaller rank and found out that bigger amount of layers wins despite having a smaller rank. That of course doesn't necessary mean that with all being equal the more LoRA layers the better, but it's best what came to my mind.

Jun 01 '23 13:06 Andrei-Aksionov

Hello @carmocca

I can help with that. Well, sorta. I don't have even a single GPU so I can create a code that supports different configurations, check that everything works (with some small model that can run on my laptop) and then someone from your team with an access to servers can run and check the results.

I am thinking about providing a string to lora context manager, something like qkvpmh, where:

q: query
k: key
v: value
p: projection
m: MLP
h: head

so it the key is provided then LoRa will be applied to corresponding weights.

Does it work for you? Or it's easier for you to do on your own rather then spending time on coordination/fixing mistakes?

Jul 03 '23 14:07 Andrei-Aksionov

@Andrei-Aksionov Feel free to start this work! We won't have time to work on this for now.

You might want to work on the lit-gpt repository instead, which also has a LoRA implementation: https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/lora.py

For the implementation, I would be more explicit, referencing the actual linear attribute names, instead of having the minified mapping of qkvpmh to the different layers. I suggest that you find the most straightforward solution that works for now. The API can always be made more complex later as we learn of new limitations or requirements that require more complexity.

Jul 03 '23 14:07 carmocca

You might want to work on the lit-gpt repository instead

Why is that? I have nothing against it, just curious.

For the implementation, I would be more explicit,

Sure, that makes sense.

Jul 03 '23 14:07 Andrei-Aksionov

We are focusing more on that project moving forward. It includes support for gpt-neox-derivative and llama-derivative weights.

Jul 03 '23 14:07 carmocca

Understood. Well, then we'll met there :)

Jul 03 '23 14:07 Andrei-Aksionov