pytorch-video-recognition
pytorch-video-recognition copied to clipboard
Downsample step
I was checking your Pytorch implementation of the R2Plus1D model against the implementation in Caffe2 in the repository of the original paper (https://github.com/facebookresearch/VMZ), and I was wondering why you chose to implement the downsample step as a SpatioTemporalConv layer, while in the original implementation they seem to use only one Conv3D layer. They have coded it as follows:
if (num_filters != input_filters) or down_sampling: shortcut_blob = self.model.ConvNd( shortcut_blob, 'shortcut_projection_%d' % self.comp_count, input_filters, num_filters, [1, 1, 1], weight_init=("MSRAFill", {}), strides=use_striding, no_bias=self.no_bias, ) if spatial_batch_norm: shortcut_blob = self.model.SpatialBN( shortcut_blob, 'shortcut_projection_%d_spatbn' % self.comp_count, num_filters, epsilon=1e-3, is_test=self.is_test, )
Was this design choice on purpose, and if so, could you perhaps tell me why?
Thanks!
Hi, sorry for the late reply.
You could look in here. When model is r2plus1d, is_decomposed
is set to True
.
When is_decomposed
is set True
, it uses SpatioTemporalConv instead of merely 3DConv, which could be checked in here.
Hi, thanks for your reply.
I understand that a SpatioTemporalConv is needed for the R(2+1)D network, but I don't think the original authors use it in their downsample step specifically, as can be found here. Your downsample step however, does use a SpatioTemporalConv. Could you explain why?
in your R(2+1)D network code:
self.conv3 = SpatioTemporalResLayer(64, 128, 3, layer_sizes[1], block_type=block_type, downsample=True)
downsample (bool, optional): If True
, the first block in layer will implement downsampling. Default: False
output size = 128 input size = 64 ,why downsample=True?
Thanks! @jfzhang95