Discussion about LISA
In the article, only the comparison of the average weight paradigm of each layer during lora fine-tuning is given.
- But what if their weights are different before fine-tuning?
- Using the weighted averaging method, is it possible that there are differences in the intermediate iteration process?
Thanks for your interest in LMFlow and LISA!
Regarding the first question, we conducted the fine-tuning based on the same seed and the same base model, so their initial weight should be the same before fine-tuning.
As for the second question, I think the weighted averaging methods are certainly different from the normal training process. But since it is less frequently adopted in practice of fine-tuning LLMs, we didn't conduct experiments on that. To draw insights from weighted averaging methods, we think at least two parts of experiments are needed if anyone is interested in this aspect:
- Weighted averaging method can reproduce the good performance as normal fine-tuning techniques
- After that, we may check the layer-wise weight norm of weighted-average-fine-tuned models
Hope this information can be helpful 😄
@research4pan I have a question about the comparison between Lisa fine-tuning and LoRA fine-tuning. LoRA fine-tuning can combine quantization to fine-tune models with larger parameters, which is called qLoRA. And in actual tests, even when quantized to 2 bits, the loss of model performance is still acceptable. For example, the performance of 32B 2-bit is much higher than that of 14B 4-bit model. And the 14B 4-bit model is equivalent to about 98% of the ability of the 14B full-Accuracy model. While Lisa fine-tuning only has full-parameter fine-tuning.
So can Lisa fine-tuning combine with quantization technology?
Because to fine-tune a 14B or 32B model with 24GB GPU memory, it is almost impossible without quantization. And if we want the model to be actually put into use, the max_model_len basically cannot be less than 4096*4.