[Bug Report] hook_normalized is inconsistent between RMSNorm and LayerNorm

Open neelnanda-io opened this issue 1 year ago • 1 comments

In layer_norm.py hook_normalized is after the gain and bias weights are used, in rms_norm.py it's before. This is inconsistent and highly confusing IMO, and should be fixed

RMS

x = self.hook_normalized(x / scale).to(self.cfg.dtype)  # [batch, pos, length]
return x * self.w

return self.hook_normalized(x * self.w + self.b).to(self.cfg.dtype)

This is an irritating situation, since I think it's super bad to be inconsistent as eg code won't transfer from an RMS Norm model to an LN model, and there'll be silent errors. However, making it consistent would be (technically) a breaking change, though I'd guess it's not widely used behaviour?

I personally think hook_normalized should be changed to be before the gain and bias weights in LN, since that's what normalized intuitively means. This is what it originally meant, I then changed it about two years ago in the early days of the library, I think because that was then "the thing that is input into the next layer". But now we have attn_input and mlp_input hooks so who cares.

Note that this is not an issue if you fold the LN weights, it's equivalent

cc @ArthurConmy @bryce13950

Oct 06 '24 17:10 neelnanda-io

3.0 is coming sooner rather than later. We can definitely work this into that release.

Oct 15 '24 22:10 bryce13950