llama3-from-scratch Transformer layer ops slightly off in image: output of feedforward should add to unnormalized residual

Transformer layer ops slightly off in image: output of feedforward should add to unnormalized residual

Open psandovalsegura opened this issue 1 year ago • 1 comments

In the image of layer ops, the final output is shown as (output of feedforward) + (normalized residual), but it should be (output of feedfoward) + (residual). In the README, this was correctly implemented as embedding_after_edit+output_after_feedforward. If we were to follow the image, however, we would have written embedding_after_edit_normalized + output_after_feedforward, which is not aligned with the llama3 code.

Looking at the original llama3 code:

class TransformerBlock(nn.Module):
    # ...
    def forward(
        self,
        x: torch.Tensor,
        start_pos: int,
        freqs_cis: torch.Tensor,
        mask: Optional[torch.Tensor],
    ):
        h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
        out = h + self.feed_forward(self.ffn_norm(h))
        return out

we can see that h should not be normalized prior to adding with the output of feedforward.

Nov 20 '24 20:11 psandovalsegura

True. In this pre-norm implementation, the procedure of norm & add in the FFN part should be the same as the self-attention part.

Mar 13 '25 08:03 SydCS

llama3-from-scratch llama3-from-scratch copied to clipboard

Transformer layer ops slightly off in image: output of feedforward should add to unnormalized residual

llama3-from-scratch
llama3-from-scratch copied to clipboard