QA-LoRA: Quantization Aware Low-Rank Adaptation
Hi there 👋
Today I came across this paper: QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models.
So what's wrong with the current QLoRA?
QLoRA quantizes pretrained weights into NF4/FP4 format, but keeps trainable LoRA weights (matrices A and B) in non-quantized form (float16/bfloat16). During each forward pass in each layer the pretrained weights are dequantized to the same dtype as LoRA weights. So that means that all calculations are still done in non-quantized form, which results in:
- No speed improvement compared to LoRA; actually it's slower as we have an overhead caused by quantization-dequantization process.
- If one wants to run a fine-tuned model on inference in a quantized form - the accuracy will drop.
That's what the authors of the paper are trying to solve: to train the model in full quantized form, so it is:
- Faster: no need to dequantize weights; weights are not in half-precision which should result in higher performance, since LLMs are IO-bound; in addition
INT4dtype is used and "INT4 operators have been optimized by CUDA and are much faster in execution". - No significant drop in accuracy on inference, especially if smaller n-bit quantization is used.
Given that it's supposedly easy to implement plus brings that much improvement during fine-tuning and inference, I think that soon enough it will be all the rage and it's worth spending time on implementing. But, of course, it's all up to @carmocca to decide.
In addition, I've noticed that for GPTQ quantization the implementation repo uses AutoGPTQ. So it might be an answer to the question what implementation to use.
If it's decided to implement, I'd love to work on it: both AutoGPTQ and QA-LoRA. But it will be after I finish the other work, which might not be soon 😞.
Links:
- Paper: https://arxiv.org/pdf/2309.14717.pdf
- Implementation: https://github.com/yuhuixu1993/qa-lora
Just read the paper and came here to suggest it and see you were already faster 😊
What I currently don't understand is why they need to shrink the number of LoRA parameters. It's something related to the quantization I guess, but wouldn't it be possible to leave the LoRA adapter weights at the same size as the original LoRA adapter weights? I guess I don't know much enough about quantization implementations ...
I think on paper this is a simple modification like you said, but that's only true if this step above is already handled automatically by quantization implementations.
How would the usage look like?
For QLoRA
python finetune/lora.py --quantize "bnb.nf4"
and for QA-LoRA
python finetune/lora.py --quantize "qa-bnb.nf4"
perhaps?
Just read the paper and came here to suggest it and see you were already faster 😊
Well, what can I say ... 😊.
I guess I don't know much enough about quantization implementations ...
Neither do I. That the main reason why I would like to work on this task - to get more hands-on experience.
What I currently don't understand is why they need to shrink the number of LoRA parameters. It's something related to the quantization I guess, but wouldn't it be possible to leave the LoRA adapter weights at the same size as the original LoRA adapter weights?
I'm not hugely confident that understood the paper correctly, but from what I see they decided to reduce the number of LoRA parameter to:
- not only fit the number of trainable parameters for quantization and LoRA (QA-LoRA) into the same budget as previous implementation of LoRA (QLoRA)
- but also fix "he imbalanced degrees of freedom for quantization and adaptation", which, as for my understanding, means that with a smaller number of LoRA parameters the model will focus more on quantization parameters in order to minimize the loss value, some sort of bottleneck.
Though I might be wrong, need to dive more into paper/code ...
I think on paper this is a simple modification like you said, but that's only true if this step above is already handled automatically by quantization implementations.
The main difference is that there is an nn.AvgPool1d that is applied on input x.
Although in the paper I see that there is an additional scaling factor that is applied after average pooling, that I don't see in the implementation:
- but also fix "he imbalanced degrees of freedom for quantization and adaptation", which, as for my understanding, means that with a smaller number of LoRA parameters the model will focus more on quantization parameters in order to minimize the loss value, some sort of bottleneck.
Though I might be wrong, need to dive more into paper/code ...
Yep, unsurprisingly, I was wrong 😊.
Everything is explained in Section 3.3:
Although, it's still not clear why we have to have $c_{i,j}$ be constant. The phrase “This is intractable in continuous and gradient-based optimization” doesn't ring any bells to me. Perhaps I need a deeper knowledge 🤔.
wouldn't it be possible to leave the LoRA adapter weights at the same size as the original LoRA adapter weights?
Technically no, the size/shape cannot be the same. We have to reduce the number of parameters of matrix lora_A so the properties of quantization are kept, and we can do the summation without an intermediate dequantization step.
If to be more accurate, we can have the same shape but the lora_A should be equal row-wise and that effectively makes its rank equal to 1 and thus hits accuracy. The proposed solution is to reduce the number of rows of matrix lora_A into L groups, quantize L groups per each column of the pretrained weights and now only values inside these groups has to match.
But if you want to have the number of learnable parameters close or equal to those in QLoRA, then you can increase the low-rank adaption value or the number of groups (or both).
Notation:
- D_in, D_out, D_int: the input, output, and low-rank adaptation dimensions
- L: the quantization group numbers of weights W (D_in // L is the group size)
With LoRA/QLoRA matrix A is (D_in, D_int) and matrix B - (D_int, D_out).
With QA-LoRA, matrix A is (L, D_int) and matrix B - (D_int, D_out).
So you can increase either L or D_int. From the table you can see that with large L (smaller group size) the accuracy is higher.
I have another question regarding the paper. In which format are the adapters? Do we use NF4 for the adapters or do we have bfloat16 and just convert it to NF4 when merging?