ltc icon indicating copy to clipboard operation
ltc copied to clipboard

Need Network architecture explanation

Open ajay9022 opened this issue 6 years ago • 2 comments

I am new to this field and wanted an explanation of what network architecture has been used in the given paper. The paper explains the architecture but it is not completely understandable.

What does the C1.........Ck mean in the below-given image(taken from the paper) and what does the yellow arrow signify and how did the C1.....Ck is getting reduced in each subsequent layer?

Screen Shot 2019-03-17 at 11 18 18 PM

Can you help me with gaining the understanding of how a given video will convolute with the network and what different dimensions of the feature maps are and some other details related to the architecture?

I seek an explanation rather than the code.

ajay9022 avatar Mar 17 '19 17:03 ajay9022

Hi, C1...Ck denote either two-channel motion (flow-x, flow-y) or three-channel appearance (R,G,B). I am afraid this question requires understanding basics of 3D convolutions. You might want to apply the model on a dummy input size and inspect the output sizes for each layer to understand what is happening and maybe also by playing with the architecture parameters such as kernel sizes. Yellow arrow illustrates the 3D convolution operation, the 3x3x3 filter is moved across all locations in the volume. See also the C3D paper for further detail. The 3D pooling operations reduce the size of the volumes. I have given the dimensionalities for each layer in the architecture definition code, models/ltc.lua.

gulvarol avatar Mar 17 '19 21:03 gulvarol

The C3D network says: The networks have 5 convolution layers and 5 pooling layers (each convolution layer is immediately followed by a pooling layer), 2 fully-connected layers and a softmax loss layer to predict action labels. The number of filters for 5 convolution layers from 1 to 5 are 64, 128, 256, 256, 256, respectively.

Is it true that the current network architecture is same as the above one? Just the input temporal and spatial resolution has changed to t ∈ {20, 40, 60, 80, 100} and spatial resolution of {58X58, 71X71} pixels.

But, later after the experimentation, the network got changed to : With current GPU memory, we design our 3D ConvNet to have 8 convolution layers, 5 pooling layers, followed by two fully connected layers, and a softmax output layer.

Why has the current paper(published by you) didn't use the 2nd architecture?

ajay9022 avatar Mar 19 '19 12:03 ajay9022