fastbook
fastbook copied to clipboard
Mistake in the calculation of multiplications in convolution layers in chapter 13
The book claims that for this
neural network,
================================================================
Layer (type) Output Shape Param # Trainable
================================================================
Conv2d 64 x 4 x 14 x 14 40 True
________________________________________________________________
ReLU 64 x 4 x 14 x 14 0 False
________________________________________________________________
Conv2d 64 x 8 x 7 x 7 296 True
________________________________________________________________
ReLU 64 x 8 x 7 x 7 0 False
________________________________________________________________
Conv2d 64 x 16 x 4 x 4 1,168 True
________________________________________________________________
ReLU 64 x 16 x 4 x 4 0 False
________________________________________________________________
Conv2d 64 x 32 x 2 x 2 4,640 True
________________________________________________________________
ReLU 64 x 32 x 2 x 2 0 False
________________________________________________________________
Conv2d 64 x 2 x 1 x 1 578 True
________________________________________________________________
Flatten 64 x 2 0 False
the amount of multiplications by the second and the third convolution layers is 56_448.
The output shape is 64x4x14x14, and this will therefore become the input shape to the next layer. The next layer, according to summary, has 296 parameters. Let's ignore the batch axis to keep things simple. So for each of 1414=196 locations we are multiplying 296-8=288 weights (ignoring the bias for simplicity), so that's 196288=56_448 multiplications at this layer. The next layer will have 77(1168-16)=56_448 multiplications.
I might be missing something, but I don't think this is true.
[Ignoring the batch axis and the biases,] the input shape is 4x14x14, and the output shape is 8x7x7. All 49 (7x7) 8-channel output activations are calculated (ignoring the biases) by multiplying the 288 weights (8x4x3x3) with the corresponding 4x3x3 part of the input. So there are only 14112 (49 * 288) multiplications.
It's not true that every number of the input is multiplied with every weight. Because of the stride, some of the 4-channel input pixels don't get multiplied with the weights corresponding to the central pixel of the kernel; some don't get multiplied with the central-right and the central-left pixels of the kernel; etc.
The antepenultimate and the penultimate convolutional layers do indeed perform the same amounts of multiplications: the one with the 16x4x4 output has 16x8x3x3 weights, which corresponds to 44 * 16833 = 18_432 multiplications; the next one is doing, accordingly, 22 * 321633, which is also 18_432, multiplications.
But there are 4 channels instead of one, so 4 * 49 * 288 = 56448 is true indeed, stride-2 is just reducing the res, but number of multiplication is still 49 * 288(for 1 channel).