LoRA why use alpha/r in stead of alpha?

why use alpha/r in stead of alpha?

Open dingguo1996 opened this issue 1 year ago • 2 comments

Paper said we need scale the lora learning rate with alpha/r. But why use alpha/r in stead of alpha?

Jul 20 '23 08:07 dingguo1996

The magnitude of the preactivation after B is \Theta(r) after training with adaptive optimizers. Dividing by r stabilizes it and makes HP tuning easier as mentioned at the end of the paragraph.

Aug 05 '23 17:08 edwardjhu

The magnitude of the preactivation after B is \Theta(r) after training with adaptive optimizers. Dividing by r stabilizes it and makes HP tuning easier as mentioned at the end of the paragraph.

Thx! But I want to further know about "The magnitude of the preactivation after B is \Theta(r)", would you like to show us an explanation? @edwardjhu

Aug 18 '23 10:08 chrisway613

LoRA LoRA copied to clipboard

why use alpha/r in stead of alpha?

LoRA
LoRA copied to clipboard