LoRA icon indicating copy to clipboard operation
LoRA copied to clipboard

AB matrix initialization in layers.py does not conform to the description of the paper

Open jinxin-zhu opened this issue 11 months ago • 4 comments

"We use a random Gaussian initialization for A and zero for B,” in paper but: ` def reset_parameters(self):

    nn.Embedding.reset_parameters(self)

    if hasattr(self, 'lora_A'):

        # initialize A the same way as the default for nn.Linear and B to zero

        nn.init.zeros_(self.lora_A)

        nn.init.normal_(self.lora_B)

` in layers.py

jinxin-zhu avatar Jul 10 '23 03:07 jinxin-zhu

Hi Jinxin,

We didn’t apply LoRA to embedding layers in the paper. In any case, this shouldn’t make a meaningful difference whether A or B is initialized to zero as long as the other one is not zero. Let me know if you see a substantial difference tho!

On Jul 9, 2023, at 11:23 PM, jinxin-zhu @.***> wrote:

"We use a random Gaussian initialization for A and zero for B,” in paper but: def reset_parameters(self): nn.Embedding.reset_parameters(self) if hasattr(self, 'lora_A'): # initialize A the same way as the default for nn.Linear and B to zero nn.init.zeros_(self.lora_A) nn.init.normal_(self.lora_B) in layers.py

— Reply to this email directly, view it on GitHub https://github.com/microsoft/LoRA/issues/98, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ5U6MHWXABDN3DVOIRDEVDXPNYRRANCNFSM6AAAAAA2D5VCCU. You are receiving this because you are subscribed to this thread.

edwardjhu avatar Jul 10 '23 12:07 edwardjhu

@edwardjhu can you please tell us why at least one of A or B has to be non-zero?

aliasvishnu avatar Jul 21 '23 12:07 aliasvishnu

@edwardjhu can you please tell us why at least one of A or B has to be non-zero?

May be the paper say that? Ensure that at the beginning of the training phase, the matrix product of LoRA's A and B is 0. Maybe it’s to start stable training.

haiduo avatar Jan 05 '24 12:01 haiduo

We want at least one of the matrix to be zero so LoRA in the first forward pass is a no-op, which indeed stabilizes training. Say that we are generating some content with a LM. If both matrices are non-zero, the random LoRA init, if large enough, might move the entire model so far from the original that we start generating garbage, which is bad for training.

edwardjhu avatar Jan 05 '24 17:01 edwardjhu