human-pose-estimation.pytorch
human-pose-estimation.pytorch copied to clipboard
Intuition behind the choice of ConvTranspose2D (deconvolution)?
Thanks for great repo!
I noticed the spatial resolution is reduced by large factor (in one network it's 32) with MaxPooling and Conv(stride>1) which decreases the spatial resolution to less than the output map resolution, this then "upsampled" with transposed convolution. What is the reasoning behind this? Was this method induced purely from empirical methods? In practice one forces the network to store spatial resolution in the feature dimension, why not just downsample less with maxpooling and less stride instead of adding transposed convolution at the end?