peft
peft copied to clipboard
Adding PiSSA as an optional initialization method of LoRA
In paper "https://arxiv.org/pdf/2404.02948.pdf", we introduce a parameter-efficient fine-tuning (PEFT) method, Principal Singular values and Singular vectors Adaptation (PiSSA), which optimizes a significantly reduced parameter space while achieving or surpassing the performance of full-parameter fine-tuning.
PiSSA is inspired by Intrinsic SAID, which suggests that pre-trained, over-parametrized models inhabit a space of low intrinsic dimension. Consequently, PiSSA represents a matrix $W\in\mathbb{R}^{m\times n}$ within the model by the product of two trainable matrices $A \in \mathbb{R}^{m\times r}$ and $B \in \mathbb{R}^{r\times n}$, where $r \ll \min(m, n)$, plus a residual matrix $W^{res}\in\mathbb{R}^{m\times n}$ for error correction. Singular value decomposition (SVD) is employed to factorize $W$, and the principal singular values and vectors of $W$ are utilized to initialize $A$ and $B$. The residual singular values and vectors initialize the residual matrix $W^{res}$, which keeps frozen during fine-tuning. Notably, PiSSA shares the same architecture with Low-Rank Adaptation (LoRA), which hypothesizes that changes in model parameters $\Delta W$ form a low-rank matrix. However, LoRA approximates $\Delta W$ through the product of two matrices, $A$, initialized with Gaussian noise, and (B), initialized with zeros, while PiSSA initializes $A$ and $B$ with principal singular values and singular vectors of the original matrix $W$. Given that the principal singular values and vectors capture the essence of a low-rank matrix, PiSSA can better approximate the outcomes of full-parameter fine-tuning at the beginning by changing the essential parts while freezing the "noisy" parts. In comparison, LoRA freezes the original matrix and updates the "noise". This distinction enables PiSSA to convergence much faster than LoRA and also achieve better performance in the end. On five common benchmarks, PiSSA outperforms LoRA on all of them using exactly the same setups except for a different initialization. On GSM8K, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, outperforming LoRA's 67.7% by 5.16%.
Due to the same architecture, PiSSA inherits many of LoRA's advantages, such as parameter efficiency and compatibility with quantization. Leveraging a fast SVD method, the initialization of PiSSA takes only a few seconds, inducing negligible cost of switching LoRA to PiSSA.
你好,发现你的Pissa
工作中提交的 PR 和原论文有一处不太相同的地方,https://github.com/huggingface/peft/blob/ec15cafd929bef508412848fc4e3bfdba46355d7/src/peft/tuners/lora/layer.py#L178,请问这里的 lora A,lora B 计算为啥和原论文对不上,论文中 lora A 矩阵是 Ur@Sr,但是在代码中却变成了 Sr @ Vr,想知道论文中的实验是基于哪一种方式计算的 Lora A 和 B?
你好,发现你的
Pissa
工作中提交的 PR 和原论文有一处不太相同的地方,https://github.com/huggingface/peft/blob/ec15cafd929bef508412848fc4e3bfdba46355d7/src/peft/tuners/lora/layer.py#L178,请问这里的 lora A,lora B 计算为啥和原论文对不上,论文中 lora A 矩阵是 Ur@Sr,但是在代码中却变成了 Sr @ Vr,想知道论文中的实验是基于哪一种方式计算的 Lora A 和 B?
您好,这里是因为线性层torch.nn.Linear(in_channel, out_channel)的矩阵维度实际上是转置过的,即W的维度实际上是out_channel X in_channel。正常情况下,需要对W进行转置,进行奇异值分解并对AB初始化后,再进行转置 才能赋值给新插入的线性层。 但是如果把Ur和Vhr的顺序调换一下,就可以避免SVD分解前后的转置操作了。 这两种计算方法是等价的,但是后者的效率更高。
Let me know when this is ready for review. Also, please run make style
on the code.
Let me know when this is ready for review. Also, please run
make style
on the code.
I've run make style
on the code, following your advice, and believe it's now ready for review. Please let me know if there's anything else needed.
Hey @fxmeng, after some internal discussion, we had some concerns about this line:
https://github.com/huggingface/peft/pull/1626/files#diff-24a141c266b7b714ae8fcc470f31bc283f7b0f5a671bbf6d5f092741fc374104R194
The issue here is that the model base weights are modified when initializing with PiSSA. This can have side-effects for the user. For example, when they disable all adapters, they would normally expect the model output to be the same as the base model, but here it's not the case. Or when a user loads a PiSSA-LoRA adapter and another LoRA adapter, that other adapter will not work correctly because it was trained on the unmodified base weight.
It would be possible to add a lot of checks everywhere and raise errors if we detect that PiSSA is used and a user wants to disable the adapter or switch to another adapter. But that's very complicated and error prone, and at the end of the day also not very user friendly. What I wonder is: How much performance would we lose if we keep the base weights unmodified? If this works almost as well, maybe we can keep the base weights and not have to add all those complications. Did you run experiments to test that?
Hey @fxmeng, after some internal discussion, we had some concerns about this line:
https://github.com/huggingface/peft/pull/1626/files#diff-24a141c266b7b714ae8fcc470f31bc283f7b0f5a671bbf6d5f092741fc374104R194
The issue here is that the model base weights are modified when initializing with PiSSA. This can have side-effects for the user. For example, when they disable all adapters, they would normally expect the model output to be the same as the base model, but here it's not the case. Or when a user loads a PiSSA-LoRA adapter and another LoRA adapter, that other adapter will not work correctly because it was trained on the unmodified base weight.
It would be possible to add a lot of checks everywhere and raise errors if we detect that PiSSA is used and a user wants to disable the adapter or switch to another adapter. But that's very complicated and error prone, and at the end of the day also not very user friendly. What I wonder is: How much performance would we lose if we keep the base weights unmodified? If this works almost as well, maybe we can keep the base weights and not have to add all those complications. Did you run experiments to test that?
Hi @BenjaminBossan, It's really a good question. In fact, we can convert a trained PiSSA into LoRA without any loss in performance, allowing the sharing of the transformed LoRA to enjoy the training efficiency improvements brought by PiSSA without the need for any special checks. We provide a function for this conversion in this code (https://github.com/fxmeng/peft/blob/c679a504d0fe581b0ea213f121f4918c875c8c43/examples/pissa_finetuning/convert_pissa_to_lora.py), along with a complete process example. We are compiling more tips on using PiSSA into a document.
In fact, we can convert a trained PiSSA into LoRA without any loss in performance, allowing the sharing of the transformed LoRA to enjoy the training efficiency improvements brought by PiSSA without the need for any special checks.
Oh nice, thanks, I think it would be great to integrate this functionality into PEFT. To be sure I understand: We first load the base model, then initialize the PEFT model with PiSSA turned on, then train the PiSSA-LoRA adapter, then we can convert it to a normal LoRA adapters and share it with others. When someone loads this converted PiSSA-LoRA adapter, it works like a normal LoRA adapter, so no need to adjust the base model weights. This means we can disable it, combine it with other LoRA adapters, etc. Is that right?
Regarding the linked script, can you explain this line (or refer to the part of the paper that explains it):
https://github.com/fxmeng/peft/blob/c679a504d0fe581b0ea213f121f4918c875c8c43/examples/pissa_finetuning/convert_pissa_to_lora.py#L26
We are compiling more tips on using PiSSA into a document.
Looking forward to this.
In fact, we can convert a trained PiSSA into LoRA without any loss in performance, allowing the sharing of the transformed LoRA to enjoy the training efficiency improvements brought by PiSSA without the need for any special checks.
Oh nice, thanks, I think it would be great to integrate this functionality into PEFT. To be sure I understand: We first load the base model, then initialize the PEFT model with PiSSA turned on, then train the PiSSA-LoRA adapter, then we can convert it to a normal LoRA adapters and share it with others. When someone loads this converted PiSSA-LoRA adapter, it works like a normal LoRA adapter, so no need to adjust the base model weights. This means we can disable it, combine it with other LoRA adapters, etc. Is that right?
Regarding the linked script, can you explain this line (or refer to the part of the paper that explains it):
https://github.com/fxmeng/peft/blob/c679a504d0fe581b0ea213f121f4918c875c8c43/examples/pissa_finetuning/convert_pissa_to_lora.py#L26
We are compiling more tips on using PiSSA into a document.
Looking forward to this.
We have explained the line you mentioned on https://github.com/fxmeng/peft/blob/7fabf84375092cc9b2d870188953602a02b9d8db/examples/pissa_finetuning/convert_pissa_to_lora.py#L26. We will include detailed instructions for converting PiSSA to LoRA in our document and the next draft of the paper. Additionally, we have fixed a bug and conducted tests on the combination of the converted LoRA and the base model to ensure its accuracy.
@fxmeng Let me know once this is ready for another review.
@fxmeng Let me know once this is ready for another review.
Hi, @BenjaminBossan, I have completed all the document writing as your suggestions, and have also provided an example code for 4-bit training in https://github.com/fxmeng/peft/tree/7b8af8e53875164e60a7707fe10f07f21c1baf75/examples/pissa_finetuning. Notably, a new experiment demonstrates that initializing PiSSA in full precision, followed by quantization in the residual model, reduces quantization error by 19% compared to QLoRA.
The PiSSA and PiSSA_niter_[1, 4, 16] initializations, compared to the original model, yield errors less than 1e-6 for the same inputs: https://github.com/fxmeng/peft/blob/7b8af8e53875164e60a7707fe10f07f21c1baf75/tests/test_initialization.py#L254-L262
It is now ready for review.
Thank you for your time.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Thanks a lot for the updates. We're making good progress but there are still a couple of steps to take. Please check out my comments.
Conversion
Regarding the PiSSA->LoRA conversion: I wonder if we can better automate this. Right now, the user has to go through extra steps, as you described in the included example. This requires them to update their training and inference scripts in a few places and it's also error prone -- loading the base model with PiSSA or loading the updated base model with PiSSA converted to LoRA would give wrong results, right?
I wonder if it would be possible to automate this step completely. Let's say that the user initializes with PiSSA, then they do the training. When they call
peft_model.save_pretrained
, could we automatically detect if PiSSA is used, and then automatically do the conversion from PiSSA to LoRA? We have the updated base weights already in memory, so we should not have to go through loading the safetensors first. If we want to give the users more options, we could add a new config argumentLoraConfig.auto_convert_pissa=True/False
and only do this automatically ifTrue
. What do you think about this, is it possible or am I missing something?Testing
I think we should a few more tests to ensure that the new functionality is well covered. What I would like to see is a more complete test that consists of:
- Initialize a model with PiSSA
- Go through the conversion step PiSSA -> LoRA
- Load the converted LoRA and show that the results are still the same
- Load an additional normal LoRA and show that it works correctly with the PiSSA model
Moreover, let's add a test for PiSSA + bnb quantization. You mention that PiSSA reduces the quantization error, so this test could check for that. We already have tests along this line for LoftQ, so the new tests could be structured similarly.
Other
- Please run
make style
- Please add the copyright notice to all new files
If you have questions, feel free to ask.
Thank you for your valuable advice.
Regarding the conversion from PiSSA to LoRA, it might not be possible to compute (\Delta W) using only the residual model and PiSSA modules during the training process. Therefore, it is necessary to save and load the initial PiSSA parameters from the local storage. How about saving the initial PiSSA and converting the fine-tuned PiSSA to LoRA in peft_model.__init__
and peft_model.save_pretrained
, respectively?
Regarding the conversion from PiSSA to LoRA, it might not be possible to compute (\Delta W) using only the residual model and PiSSA modules during the training process. Therefore, it is necessary to save and load the initial PiSSA parameters from the local storage. How about saving the initial PiSSA and converting the fine-tuned PiSSA to LoRA in
peft_model.__init__
andpeft_model.save_pretrained
, respectively?
So we need not only the modified base weights + the trained PiSSA weights, but also the untrained PiSSA weights, do I get this right?
How about saving the initial PiSSA and converting the fine-tuned PiSSA to LoRA in
peft_model.__init__
andpeft_model.save_pretrained
, respectively?
I see the argument for updating save_pretrained
, but don't understand yet why we have to touch __init__
. Anyway, if you have come up with an idea to simplify these steps, please go ahead and implement them and we can iterate based on that.
Regarding the conversion from PiSSA to LoRA, it might not be possible to compute (\Delta W) using only the residual model and PiSSA modules during the training process. Therefore, it is necessary to save and load the initial PiSSA parameters from the local storage. How about saving the initial PiSSA and converting the fine-tuned PiSSA to LoRA in
peft_model.__init__
andpeft_model.save_pretrained
, respectively?So we need not only the modified base weights + the trained PiSSA weights, but also the untrained PiSSA weights, do I get this right?
How about saving the initial PiSSA and converting the fine-tuned PiSSA to LoRA in
peft_model.__init__
andpeft_model.save_pretrained
, respectively?I see the argument for updating
save_pretrained
, but don't understand yet why we have to touch__init__
. Anyway, if you have come up with an idea to simplify these steps, please go ahead and implement them and we can iterate based on that.
Yes, the modified base weights + the trained PiSSA weights result in the fine-tuned model, but the difference from the pre-training model can only be calculated using the initial PiSSA parameters.
Yes, the modified base weights + the trained PiSSA weights result in the fine-tuned model, but the difference from the pre-training model can only be calculated using the initial PiSSA parameters.
Okay, makes sense, thanks.
Sorry, I was on a conference these past few days. Will review soon.
Thanks a lot for the recent changes, these should make the usage of PiSSA more comfortable for the user, good work.
I left a couple of comments, please check them out. Moreover, we should extend the testing a bit. Check out these tests, where we initialize LoRA with a few different options, like rslora. Maybe we can add a test for PiSSA there as well?
On top, let's add one test for the conversion part, as I described in this comment in the
Testing
section. I saw that the test you added already usessave_as_lora
, but as is, this test is only focused on the quantization error. We also need a test to ensure that after conversion from PiSSA to LoRA, the results using the original base model with the converted adapter are still the same.
Thank you for your valuable advice. We have revised the corresponding code according to your suggestions. Please review it and let us know if there are any further modifications needed.
Thanks for the updates, I don't think we're missing a lot at this point.
In this review, I did a deep dive into the conversion part so that I hopefully understand it better now. Right now, it is not covered by tests well enough. Therefore, I created the following test. Could you please check if it makes sense to you and if yes, please add it to the existing test. We should also have a similar test for bnb quantized weights.
def test_lora_pissa_conversion_same_output_after_loading(self, data, tmp_path): model = self.get_model() output_base = model(data)[0] config = LoraConfig(init_lora_weights="pissa", target_modules=["linear"], r=8) peft_model = get_peft_model(deepcopy(model), config) # save the initial model peft_model.peft_config["default"].init_lora_weights = True peft_model.save_pretrained(tmp_path / "init-model") peft_model.peft_config["default"].init_lora_weights = "pissa" # modify the weights, or else the adapter performs an identity transformation peft_model.base_model.linear.lora_B["default"].weight.data *= 2.0 output_pissa = peft_model(data)[0] # sanity check tol = 1e-06 assert not torch.allclose(output_base, output_pissa, atol=tol, rtol=tol) # save the model normally peft_model.save_pretrained(tmp_path / "pissa-model") model_loaded = PeftModel.from_pretrained(deepcopy(model), tmp_path / "pissa-model") output_loaded = model_loaded(data)[0] assert torch.allclose(output_pissa, output_loaded, atol=tol, rtol=tol) # sanity check: ranks should still be 8 as initially assert model_loaded.peft_config["default"].r == 8 assert model_loaded.base_model.model.linear.lora_A["default"].weight.shape[0] == 8 # sanity check: the base model weights were indeed changed assert not torch.allclose(model.linear.weight, model_loaded.base_model.model.linear.base_layer.weight, atol=tol, rtol=tol) # save the model with conversion peft_model.save_pretrained(tmp_path / "pissa-model-converted", convert_pissa_to_lora=tmp_path / "init-model") model_converted = PeftModel.from_pretrained(deepcopy(model), tmp_path / "pissa-model-converted") output_converted = model_converted(data)[0] assert torch.allclose(output_pissa, output_converted, atol=tol, rtol=tol) # rank should be double of what it was initially assert model_converted.peft_config["default"].r == 16 assert model_converted.base_model.model.linear.lora_A["default"].weight.shape[0] == 16 # base model weights should be the same as the initial model assert torch.allclose(model.linear.weight, model_converted.base_model.model.linear.base_layer.weight, atol=tol, rtol=tol)
Apart from that, I only have a few smaller comments, please check them out.
We are excited to hear that the process is nearing completion. I have carefully reviewed the program you wrote concerning the transformation from PiSSA to LoRA, and confirm that it makes perfect sense—thank you for your effort. While attempting to apply it to a quantized model, I discovered that the quantization error of W_res is not equal to that of W, theoretically leading to Quant(W_res) + AB !=Quant(W) + \Delta(AB). Therefore, it is advisable to use the PiSSA adapter in conjunction with Quant(W_res) during quantized training.
Hey @fxmeng, have been following this PR to try it out for my own projects, thanks for the substantive work, have a few questions.
We are excited to hear that the process is nearing completion. I have carefully reviewed the program you wrote concerning the transformation from PiSSA to LoRA, and confirm that it makes perfect sense—thank you for your effort. While attempting to apply it to a quantized model, I discovered that the quantization error of W_res is not equal to that of W, theoretically leading to Quant(W_res) + AB !=Quant(W) + \Delta(AB). Therefore, it is advisable to use the PiSSA adapter in conjunction with Quant(W_res) during quantized training.
Does this means that as long as I am not using Quantization for finetuning and serving the model, it is alright for me convert the pissa to lora, whereas in case of QLoRA like training, i.e. when using BitsAndBytesConfig
I should be using the residual base model instead of vanilla base model for inference?
Did you ran any evals to quantify the magnitude of impact because of the mismatch, with quantized models?
Does this means that as long as I am not using Quantization for finetuning and serving the model, it is alright for me convert the pissa to lora?
Yes, converting PiSSA to LoRA is theoretically equivalent and does not introduce additional losses.
In case of QLoRA like training, i.e. when using
BitsAndBytesConfig
I should be using the residual base model instead of vanilla base model for inference?
Yes, since the quant(W_res) was used during the training phase, using it for inference introduces no errors, whereas quant(W_res) conbine with converted LoRA does introduce errors.
Did you ran any evals to quantify the magnitude of impact because of the mismatch, with quantized models?
As you recommended, we conducted experiments using 4-bit PiSSA training on LLaMA-3-8B, comparing the inference performance on the GSM8K and MATH datasets before and after converting PiSSA to LoRA. As shown in the table below, the converted performance is acceptable. Therefore, we only raise a warning message when users wish to proceed in this manner.
4bit Fine-tuning | GSM8K | MATH |
---|---|---|
pissa + Quant(W_res) | 74.98 | 24.48 |
pissa_to_lora + Quant(W) | 74.45 | 24.5 |
These experiments will be extended to more models and tasks, and will be included in the next version of our paper.
Hey @fxmeng,
Thanks for such a thoughtful, detailed and helpful answer. I am trying it out and seeing very encouraging initial results. Kudos for this fantastic work.
While trying out this PR, I found a minor bug in preprocess.py
, have commented the same, see if you agree with it.
Thanks for such a thoughtful, detailed and helpful answer. I am trying it out and seeing very encouraging initial results. Kudos for this fantastic work.
While trying out this PR, I found a minor bug in
preprocess.py
, have commented the same, see if you agree with it.
Thank you for your valuable comments; we have now addressed the corresponding issues.
Thanks for the updates @fxmeng. Could you please also fix the merge conflict and let me know once this is ready for review?
Thanks for the updates @fxmeng. Could you please also fix the merge conflict and let me know once this is ready for review?
Hey @BenjaminBossan,
I have added the test programs test_lora_pissa_conversion_same_output_after_loading
and pytest.mark.xfail
. Additionally, I have resolved conflicts, and we are now ready for review.
The test
test_t5_pissa_8bit[cuda]
is failing on when I run it on my machine:AssertionError: assert tensor(0.0288, device='cuda:0', grad_fn=) < (tensor(0.0223, device='cuda:0', grad_fn=) / 1.03)
Can you reproduce that? As you can see, the margin is quite big here, same when I check MAE. Any idea why we see such an increase in quantization error when using PiSSA on this specific model?
It is quite strange that I can pass the make style
test locally. Moreover, directly runing the following script will achieve reasonable MAE and MSE values. The mse_quantized
I obtain is 0.0761, which is higher than 0.0223, whereas mse_pissa
is approximately equal to 0.0288.
16.18s call tests/test_gpu_examples.py::TestPiSSA::test_t5_pissa_8bit[cpu]
16.10s call tests/test_gpu_examples.py::TestPiSSA::test_t5_pissa_4bit[cpu]
13.44s call tests/test_gpu_examples.py::TestPiSSA::test_t5_pissa_4bit[cuda]
12.89s call tests/test_gpu_examples.py::TestPiSSA::test_t5_pissa_8bit[cuda]
8.85s call tests/test_gpu_examples.py::TestPiSSA::test_bloomz_pissa_4bit[cuda]
7.05s call tests/test_gpu_examples.py::TestPiSSA::test_bloomz_pissa_8bit[cpu]
6.81s call tests/test_gpu_examples.py::TestPiSSA::test_bloomz_pissa_4bit[cpu]
6.33s call tests/test_gpu_examples.py::TestPiSSA::test_bloomz_pissa_8bit[cuda]
from transformers import AutoTokenizer, BitsAndBytesConfig, AutoModelForSeq2SeqLM, AutoModelForCausalLM
import gc
import torch
from peft import (
LoraConfig,
PeftModel,
TaskType,
get_peft_model,
)
class TestPiSSA:
r"""
Tests for PiSSA to ensure that it reduces the quantization error compared to normal LoRA quantization.
"""
# The error factor indicates by how much the quantization error should be decreased when using PiSSA compared to
# quantization without PiSSA. Thus 1.03 means that the error should be decreased by 3% at least. This is a very
# conservative value to prevent flakiness, in practice most gains are > 1.5
error_factor = 1.03
def get_input(self, model_id, device):
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer("All I want is", padding=True, return_tensors="pt")
if device == "cuda":
inputs = inputs.to("cuda")
return inputs
def get_base_model(self, model_id, device, **kwargs):
cls = AutoModelForSeq2SeqLM if "t5" in str(model_id) else AutoModelForCausalLM
model = cls.from_pretrained(model_id, **kwargs).eval()
if device == "cuda":
model = model.to("cuda")
return model
@torch.no_grad()
def get_logits(self, model, inputs):
if model.config.is_encoder_decoder:
input_ids = inputs["input_ids"]
return model(input_ids=input_ids, decoder_input_ids=input_ids).logits
return model(**inputs).logits
def get_errors(
self,
tmp_path,
bits=4,
device="cuda",
model_id="hf-internal-testing/tiny-random-BloomForCausalLM",
):
# Helper function that returns the quantization errors (MAE and MSE) when comparing the quantized LoRA model
# to the base model, vs the PiSSA quantized model to the base model. We expect the PiSSA quantized model to
# have less error than the normal LoRA quantized model. Since we compare logits, the observed error is
# already somewhat dampened because of the softmax.
torch.manual_seed(0)
model = self.get_base_model(model_id, device)
task_type = TaskType.SEQ_2_SEQ_LM if model.config.is_encoder_decoder else TaskType.CAUSAL_LM
inputs = self.get_input(model_id, device)
# the base logits are the reference, we try to match those as closely as possible
logits_base = self.get_logits(model, inputs)
# clean up
del model
gc.collect()
torch.cuda.empty_cache()
# logits from the normal quantized LoRA model
target_modules = "all-linear" if task_type != TaskType.SEQ_2_SEQ_LM else ["o", "k", "wi", "q", "v"]
lora_config = LoraConfig(task_type=task_type, target_modules=target_modules)
kwargs = {}
if bits == 4:
kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
elif bits == 8:
kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
else:
raise ValueError("bits must be 4 or 8")
quantized_model = get_peft_model(
self.get_base_model(model_id, device=None, **kwargs),
lora_config,
)
torch.manual_seed(0)
logits_quantized = self.get_logits(quantized_model, inputs)
del quantized_model
gc.collect()
torch.cuda.empty_cache()
# logits from quantized LoRA model using PiSSA
lora_config = LoraConfig(
task_type=task_type,
init_lora_weights="pissa",
target_modules=target_modules,
)
model = self.get_base_model(model_id, device)
if device == "cuda":
model = model.to("cuda")
pissa_model = get_peft_model(model, lora_config)
if device == "cuda":
pissa_model = pissa_model.to("cuda")
# save LoRA weights, they should be initialized such that they minimize the quantization error
pissa_model.base_model.peft_config["default"].init_lora_weights = True
pissa_model.save_pretrained(f"{tmp_path}/pissa_model")
pissa_model = pissa_model.unload()
pissa_model.save_pretrained(f"{tmp_path}/residual_model")
del pissa_model
gc.collect()
torch.cuda.empty_cache()
# now load quantized model and apply PiSSA-initialized weights on top
base_model = self.get_base_model(
f"{tmp_path}/residual_model",
device=None,
**kwargs,
torch_dtype=torch.float32,
)
pissa_model = PeftModel.from_pretrained(base_model, f"{tmp_path}/pissa_model", is_trainable=True)
# TODO sanity check: model is quantized
torch.manual_seed(0)
logits_pissa = self.get_logits(pissa_model, inputs)
del pissa_model
gc.collect()
torch.cuda.empty_cache()
mae_quantized = torch.abs(logits_base - logits_quantized).mean()
mse_quantized = torch.pow(logits_base - logits_quantized, 2).mean()
mae_pissa = torch.abs(logits_base - logits_pissa).mean()
mse_pissa = torch.pow(logits_base - logits_pissa, 2).mean()
return mae_quantized, mse_quantized, mae_pissa, mse_pissa
test = TestPiSSA()
print(test.get_errors(tmp_path='t5' , bits=8, model_id="google/flan-t5-base"))
# output: (tensor(0.2376), tensor(0.0761), tensor(0.1447), tensor(0.0290))
print(test.get_errors(tmp_path='t5' , bits=4, model_id="google/flan-t5-base"))
# output: (tensor(1.6247), tensor(3.5636), tensor(0.6988), tensor(0.7377))
print(test.get_errors(tmp_path='bloom' , bits=8, model_id="hf-internal-testing/tiny-random-BloomForCausalLM"))
# output: (tensor(7.4336e-05), tensor(8.8446e-09), tensor(2.3870e-05), tensor(9.1838e-10)
print(test.get_errors(tmp_path='bloom' , bits=4, model_id="hf-internal-testing/tiny-random-BloomForCausalLM"))
# output: (tensor(0.0004), tensor(2.2412e-07), tensor(0.0003), tensor(1.3218e-07)
The
xfail
marker that you added is in the wrong place (it's not doing anything). What I meant is the following: IIUC, the testtest_lora_pissa_conversion_same_output_after_loading
would fail if we use bitsandbytes quantization. So I would like to see a copy of that test, only that bnb is used. As this test will fail, thexfail
decorator should be added to that test.
bnb.nn.Params4bit requires the use of CUDA, so where should I put test_lora_pissa_conversion_same_output_after_loading_with_quantization? In test_gpu_examples.py or in test_initialization.py, and add @pytest.mark.skipif(not torch.cuda.is_available(), reason="test requires a GPU")?
The
xfail
marker that you added is in the wrong place (it's not doing anything). What I meant is the following: IIUC, the testtest_lora_pissa_conversion_same_output_after_loading
would fail if we use bitsandbytes quantization. So I would like to see a copy of that test, only that bnb is used. As this test will fail, thexfail
decorator should be added to that test.bnb.nn.Params4bit requires the use of CUDA, so where should I put test_lora_pissa_conversion_same_output_after_loading_with_quantization? In test_gpu_examples.py or in test_initialization.py, and add @pytest.mark.skipif(not torch.cuda.is_available(), reason="test requires a GPU")?
I have placed test_lora_pissa_conversion_same_output_after_loading_use_quantization in test_initialization.py. Could you please check if this implementation is appropriate?
The test
test_t5_pissa_8bit[cuda]
is failing on when I run it on my machine:AssertionError: assert tensor(0.0288, device='cuda:0', grad_fn=) < (tensor(0.0223, device='cuda:0', grad_fn=) / 1.03)
Can you reproduce that? As you can see, the margin is quite big here, same when I check MAE. Any idea why we see such an increase in quantization error when using PiSSA on this specific model?
Could you please run
make style
to make the CI pass?
Instead of perplexity, I used the nuclear norm of the error matrix between each layer of the quantized pissa model and the base model to evaluate the magnitude of the quantization error in my paper. This implementation is not affected by factors such as random seeds, and the error calculated for each model is a fixed value. If the test_t5_pissa_8bit test still cannot pass on your local machine, how do you feel about replacing this test with the one used in the paper?
When I run your test above, the values I get the same or very similar values, except for T5 + 8bit:
(tensor(0.1253, device='cuda:0'), tensor(0.0223, device='cuda:0'), tensor(0.1440, device='cuda:0'), tensor(0.0288, device='cuda:0'))
(tensor(1.6214, device='cuda:0'), tensor(3.5510, device='cuda:0'), tensor(0.6988, device='cuda:0'), tensor(0.7377, device='cuda:0'))
(tensor(7.4336e-05, device='cuda:0'), tensor(8.8446e-09, device='cuda:0'), tensor(2.3471e-05, device='cuda:0'), tensor(8.9277e-10, device='cuda:0'))
(tensor(0.0004, device='cuda:0'), tensor(2.2412e-07, device='cuda:0'), tensor(0.0003, device='cuda:0'), tensor(1.3223e-07, device='cuda:0'))
Not sure why that is, perhaps it's best to just remove that specific test (in that case, add a comment that this combination may fail on some machines).
I used the nuclear norm of the error matrix between each layer of the quantized pissa model and the base model to evaluate the magnitude of the quantization error in my paper. This implementation is not affected by factors such as random seeds, and the error calculated for each model is a fixed value. If the test_t5_pissa_8bit test still cannot pass on your local machine, how do you feel about replacing this test with the one used in the paper?
I admit that calculating MAE/MSE of logits is a bit flawed as a measure, this was chosen more from a practical viewpoint. I don't know this measure that you proposed and would need to read a bit more, but if you think it's superior, feel free to use it instead. But as mentioned, it would also be fine to remove this one specific test.
It is quite strange that I can pass the make style test locally
Maybe it's the ruff version? The version that the CI uses is ruff-0.2.2
. If this doesn't solve the issue for you, let me know and I can send you a patch.
bnb.nn.Params4bit requires the use of CUDA, so where should I put
Ah yes, good point, then it should go to tests/test_gpu_examples.py
. You could add a comment that references the other test in test_initialization.py
so that we know that the two belong together.
When I run your test above, the values I get the same or very similar values, except for T5 + 8bit:
(tensor(0.1253, device='cuda:0'), tensor(0.0223, device='cuda:0'), tensor(0.1440, device='cuda:0'), tensor(0.0288, device='cuda:0')) (tensor(1.6214, device='cuda:0'), tensor(3.5510, device='cuda:0'), tensor(0.6988, device='cuda:0'), tensor(0.7377, device='cuda:0')) (tensor(7.4336e-05, device='cuda:0'), tensor(8.8446e-09, device='cuda:0'), tensor(2.3471e-05, device='cuda:0'), tensor(8.9277e-10, device='cuda:0')) (tensor(0.0004, device='cuda:0'), tensor(2.2412e-07, device='cuda:0'), tensor(0.0003, device='cuda:0'), tensor(1.3223e-07, device='cuda:0'))
Not sure why that is, perhaps it's best to just remove that specific test (in that case, add a comment that this combination may fail on some machines).
I used the nuclear norm of the error matrix between each layer of the quantized pissa model and the base model to evaluate the magnitude of the quantization error in my paper. This implementation is not affected by factors such as random seeds, and the error calculated for each model is a fixed value. If the test_t5_pissa_8bit test still cannot pass on your local machine, how do you feel about replacing this test with the one used in the paper?
I admit that calculating MAE/MSE of logits is a bit flawed as a measure, this was chosen more from a practical viewpoint. I don't know this measure that you proposed and would need to read a bit more, but if you think it's superior, feel free to use it instead. But as mentioned, it would also be fine to remove this one specific test.
It is quite strange that I can pass the make style test locally
Maybe it's the ruff version? The version that the CI uses is
ruff-0.2.2
. If this doesn't solve the issue for you, let me know and I can send you a patch.bnb.nn.Params4bit requires the use of CUDA, so where should I put
Ah yes, good point, then it should go to
tests/test_gpu_examples.py
. You could add a comment that references the other test intest_initialization.py
so that we know that the two belong together.
I have changed the method for measuring quantization errors from calculating the MAE/MSE from a practical viewpoint to calculating the nuclear norm of all error matrices. This method has a fixed error for each model and has passed tests in my local environment.