video-classification
video-classification copied to clipboard
Tensor sizes input in ConvNet and RNN
Thanks for the code. I have a couple of questions regarding tensor sizes.
-
The dataloader creates tensors size
X= (#videos, #frames, 3, H, W) and y=(#videos, 1)
. There's a loop in thetrain
method for #videos, but in my implementation it only returned index=0, so the input in the ConvNet is size(#videos, #frames, 3, H, W)
. Is this correct? -
In the ConvNet's
forward
method there's a loop for #frames in the video, it transforms the pool layer into a vector to get tensor (#videos, #frames,CNN_embed_dim
), which is both the output of the ConvNet and input in the RNN. Is this right?
I don't quite understand how the RNN processes batch, i.e. the number of videos. Is there some internal loop for this that I can't find in the code?