addons
                                
                                 addons copied to clipboard
                                
                                    addons copied to clipboard
                            
                            
                            
                        Group normalization documentation is incorrect
Describe the bug
This is purely about the documentation.
In the documentation about group normalization, it is stated:
Relation to Layer Normalization: If the number of groups is set to 1, then this operation becomes identical to Layer Normalization.
However, that is not true.
Assume an input tensor x of shape [B,T,F] (batch, time, feature-dim) (time could also be H/W instead; feature-dim can also be the channels).
In layer normalization, the mean you calculate is:
mean = reduce_mean(x, axis=-1, keepdims=True)  # shape [B,T,1]
You normalize just over the feature axis.
In group normalization with G=1 (ignore the group shape then), the mean you calculate is:
mean = reduce_mean(x, axis=[1,2], keepdims=True)  # shape [B,1,1]
You normalize over all axes except the batch axis and the newly added group axis (doesn't matter if G=1).
Or do I misunderstand sth? I wonder because the same wrong statement is in the original group-normalization paper.
The figure from the paper (also here) is also misleading:
 In this figure, it looks like layer-normalization normalizes over H/W as well. But this is not the case (at least commonly, and also with the default options).
So, this figure is wrong about layer-normalization (it would just normalize over C, not H/W).
But the figure is correct for group-normalization as you have implemented it (it normalizes over all axes except N/G).
In this figure, it looks like layer-normalization normalizes over H/W as well. But this is not the case (at least commonly, and also with the default options).
So, this figure is wrong about layer-normalization (it would just normalize over C, not H/W).
But the figure is correct for group-normalization as you have implemented it (it normalizes over all axes except N/G).
I also formulated the question here.