LoRA
LoRA copied to clipboard
why use alpha/r in stead of alpha?
Paper said we need scale the lora learning rate with alpha/r. But why use alpha/r in stead of alpha?
The magnitude of the preactivation after B is \Theta(r) after training with adaptive optimizers. Dividing by r stabilizes it and makes HP tuning easier as mentioned at the end of the paragraph.
The magnitude of the preactivation after B is \Theta(r) after training with adaptive optimizers. Dividing by r stabilizes it and makes HP tuning easier as mentioned at the end of the paragraph.
Thx! But I want to further know about "The magnitude of the preactivation after B is \Theta(r)", would you like to show us an explanation? @edwardjhu