human-pose-estimation.pytorch icon indicating copy to clipboard operation
human-pose-estimation.pytorch copied to clipboard

Intuition behind the choice of ConvTranspose2D (deconvolution)?

Open anuar12 opened this issue 5 years ago • 0 comments

Thanks for great repo!

I noticed the spatial resolution is reduced by large factor (in one network it's 32) with MaxPooling and Conv(stride>1) which decreases the spatial resolution to less than the output map resolution, this then "upsampled" with transposed convolution. What is the reasoning behind this? Was this method induced purely from empirical methods? In practice one forces the network to store spatial resolution in the feature dimension, why not just downsample less with maxpooling and less stride instead of adding transposed convolution at the end?

anuar12 avatar Jan 09 '20 16:01 anuar12