machine-learning-book
machine-learning-book copied to clipboard
Batch vs layer normalization description (page 560)
Hi Sebastian,
There is the description of the batch and layer normalization (including the picture) on the page 560:
"While layer normalization is traditionally performed across all elements in a given feature for each feature independently, the layer normalization used in transformers extends this concept and computes the normalization statistics across all feature values independently for each training example."
Is layer normalization mentioned correctly in the first case? It seems that when we calculate statistics for each feature independently, we perform batch normalization?
Thank you.
Thanks for the note! Phew this is a tricky one
Originally:
While layer normalization is traditionally performed across all elements in a given feature for each feature independently, the layer normalization used in transformers extends this concept and computes the normalization statistics across all feature values independently for each training example.
Yeah, reading this again, it does sound a bit weird. Here is an attempt to clarify it:
While layer normalization is traditionally performed across all feature values in a given layer for each training example independently, the layer normalization used in transformers extends this concept and computes the normalization statistics across all feature values for a given sentence token position independently for each training example
Maybe I should also swap the figure with a better one. This is the figure from the original layer norm paper:

I just found this helpful one here for transformer contexts:
Do you think that showing it like this would make it more clear? We can potentially update the book and I could swap it out. Do you think it's a helpful change?
I don't have much practical experience in DL, but I suppose the following:
- The first picture has dimension annotations which are more specific to CNN than to sequences. And it can be confusing to read PyTorch documentation, where for example convolutional layer will have N, C, L (for 1d) or N, C, H, W (for 2d) dimensions, while for embedding we will have swapped dimensions (N, L, C), and others can have (L, N, C) dimensions: https://discuss.pytorch.org/t/inconsistent-dimension-ordering-for-1d-networks-ncl-vs-nlc-vs-lnc/14807
- The implementations which I saw for transformers, including "The Annotated transformer", as I understand, uses LayerNorm with normalization only by the last dimension (embeddings dimension), so it seems that classical Layer normalization is used.
- I suppose that the second picture is from this article, if you mean the left case in the picture then it is probably better for understanding of your phrase. As I understand, this type of layer normalization is not described in the article in detail, just for reference.
- Unfortunately I didn't find any description of this extended type of normalization neither in articles (for example, here) nor in implementations, and it will be great if you suggest any additional information about that.
- Anyway it is pretty strange to me that this article with this new picture uses the same name (layer normalization) for the algorithm which is different from classical layer normalization. For example, if "we just remove the sum over NN in the previous equation compared to BN" (citation from here), then it has separate name - instance normalization. But for our case I didn't find any own naming.