llama3-from-scratch
llama3-from-scratch copied to clipboard
Transformer layer ops slightly off in image: output of feedforward should add to unnormalized residual
In the image of layer ops, the final output is shown as (output of feedforward) + (normalized residual), but it should be (output of feedfoward) + (residual). In the README, this was correctly implemented as embedding_after_edit+output_after_feedforward. If we were to follow the image, however, we would have written embedding_after_edit_normalized + output_after_feedforward, which is not aligned with the llama3 code.
Looking at the original llama3 code:
class TransformerBlock(nn.Module):
# ...
def forward(
self,
x: torch.Tensor,
start_pos: int,
freqs_cis: torch.Tensor,
mask: Optional[torch.Tensor],
):
h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
out = h + self.feed_forward(self.ffn_norm(h))
return out
we can see that h should not be normalized prior to adding with the output of feedforward.
True. In this pre-norm implementation, the procedure of norm & add in the FFN part should be the same as the self-attention part.