fastbook icon indicating copy to clipboard operation
fastbook copied to clipboard

Mistake in the calculation of multiplications in convolution layers in chapter 13

Open Mihonarium opened this issue 2 years ago • 1 comments

The book claims that for this

neural network,
================================================================
Layer (type)         Output Shape         Param #    Trainable 
================================================================
Conv2d               64 x 4 x 14 x 14     40         True      
________________________________________________________________
ReLU                 64 x 4 x 14 x 14     0          False     
________________________________________________________________
Conv2d               64 x 8 x 7 x 7       296        True      
________________________________________________________________
ReLU                 64 x 8 x 7 x 7       0          False     
________________________________________________________________
Conv2d               64 x 16 x 4 x 4      1,168      True      
________________________________________________________________
ReLU                 64 x 16 x 4 x 4      0          False     
________________________________________________________________
Conv2d               64 x 32 x 2 x 2      4,640      True      
________________________________________________________________
ReLU                 64 x 32 x 2 x 2      0          False     
________________________________________________________________
Conv2d               64 x 2 x 1 x 1       578        True      
________________________________________________________________
Flatten              64 x 2               0          False 

the amount of multiplications by the second and the third convolution layers is 56_448.

The output shape is 64x4x14x14, and this will therefore become the input shape to the next layer. The next layer, according to summary, has 296 parameters. Let's ignore the batch axis to keep things simple. So for each of 1414=196 locations we are multiplying 296-8=288 weights (ignoring the bias for simplicity), so that's 196288=56_448 multiplications at this layer. The next layer will have 77(1168-16)=56_448 multiplications.

I might be missing something, but I don't think this is true.

[Ignoring the batch axis and the biases,] the input shape is 4x14x14, and the output shape is 8x7x7. All 49 (7x7) 8-channel output activations are calculated (ignoring the biases) by multiplying the 288 weights (8x4x3x3) with the corresponding 4x3x3 part of the input. So there are only 14112 (49 * 288) multiplications.

It's not true that every number of the input is multiplied with every weight. Because of the stride, some of the 4-channel input pixels don't get multiplied with the weights corresponding to the central pixel of the kernel; some don't get multiplied with the central-right and the central-left pixels of the kernel; etc.

The antepenultimate and the penultimate convolutional layers do indeed perform the same amounts of multiplications: the one with the 16x4x4 output has 16x8x3x3 weights, which corresponds to 44 * 16833 = 18_432 multiplications; the next one is doing, accordingly, 22 * 321633, which is also 18_432, multiplications.

Mihonarium avatar Sep 13 '22 23:09 Mihonarium

But there are 4 channels instead of one, so 4 * 49 * 288 = 56448 is true indeed, stride-2 is just reducing the res, but number of multiplication is still 49 * 288(for 1 channel).

vonrafael avatar Apr 19 '24 13:04 vonrafael