pytorch-video-recognition icon indicating copy to clipboard operation
pytorch-video-recognition copied to clipboard

Downsample step

Open LeoniekevandenBulk opened this issue 6 years ago • 3 comments

I was checking your Pytorch implementation of the R2Plus1D model against the implementation in Caffe2 in the repository of the original paper (https://github.com/facebookresearch/VMZ), and I was wondering why you chose to implement the downsample step as a SpatioTemporalConv layer, while in the original implementation they seem to use only one Conv3D layer. They have coded it as follows:

if (num_filters != input_filters) or down_sampling: shortcut_blob = self.model.ConvNd( shortcut_blob, 'shortcut_projection_%d' % self.comp_count, input_filters, num_filters, [1, 1, 1], weight_init=("MSRAFill", {}), strides=use_striding, no_bias=self.no_bias, ) if spatial_batch_norm: shortcut_blob = self.model.SpatialBN( shortcut_blob, 'shortcut_projection_%d_spatbn' % self.comp_count, num_filters, epsilon=1e-3, is_test=self.is_test, )

Was this design choice on purpose, and if so, could you perhaps tell me why?

Thanks!

LeoniekevandenBulk avatar Nov 27 '18 19:11 LeoniekevandenBulk

Hi, sorry for the late reply.

You could look in here. When model is r2plus1d, is_decomposed is set to True.

When is_decomposed is set True, it uses SpatioTemporalConv instead of merely 3DConv, which could be checked in here.

jfzhang95 avatar Dec 01 '18 10:12 jfzhang95

Hi, thanks for your reply.

I understand that a SpatioTemporalConv is needed for the R(2+1)D network, but I don't think the original authors use it in their downsample step specifically, as can be found here. Your downsample step however, does use a SpatioTemporalConv. Could you explain why?

LeoniekevandenBulk avatar Dec 03 '18 16:12 LeoniekevandenBulk

in your R(2+1)D network code: self.conv3 = SpatioTemporalResLayer(64, 128, 3, layer_sizes[1], block_type=block_type, downsample=True) downsample (bool, optional): If True, the first block in layer will implement downsampling. Default: False output size = 128 input size = 64 ,why downsample=True? Thanks! @jfzhang95

JinXiaozhao avatar Dec 03 '19 14:12 JinXiaozhao