ltc icon indicating copy to clipboard operation
ltc copied to clipboard

What does the paper mean by different temporal extent being used on 60f network?

Open ajay9022 opened this issue 6 years ago • 7 comments

I was reading the paper Long-term Temporal Convolutions for Action Recognition and read that they have tried different temporal extent t ∈{20,40,60,80,100} on the 60f Network.

I didn't get the term temporal extent used here. Can you also explain what does 60f network mean?

From this link I got to know that a video is made up of many clips and each clip is of some x frames. Does that hold true in this paper too?

ajay9022 avatar Feb 15 '19 17:02 ajay9022

The temporal extent is simply the number of input frames (clip) of the network. t ∈ {20,40,60,80,100} is not on the 60f network. It is either 20-frames (20f), 40-frames (40f), 60-frames (60f) etc. We did experiments with different input resolutions. Yes, the terminology must be the same for clip and video.

gulvarol avatar Feb 15 '19 17:02 gulvarol

So, that means for 60f network the input is of 60 frames. Right?

Also, one of the takeaways of the paper is that higher temporal resolution inputs get better accuracy. So, that means the inputs with more input frames are identified better than those with fewer frames.

So, does that mean the difference between 2 consecutive frames in temporal extent = 100 is less than that in temporal extent = 60 case? Because now for the same video there will be lesser frames which will be far apart.

ajay9022 avatar Feb 15 '19 18:02 ajay9022

Yes 60f means 60 frames.

The difference between 2 consecutive frames for 60f and 100f is the same since we always sample consecutive frames from the original video. More randomness in such sampling could improve the results. This is not something we investigated in that paper.

gulvarol avatar Feb 17 '19 15:02 gulvarol

That means that for a video of 240 frames when fed into a 60f network will only take the first 60 frames and neglect the last 180 frames. This surely means that there is a lack of information that we are feeding into the network. This will surely hamper the accuracy in recognising the video.

Just to confirm, did I get it right?

ajay9022 avatar Mar 18 '19 14:03 ajay9022

This is explained in the last two paragraphs of Section 3.3 of the paper. At training, we take a random (not necessarily the first) 60-frame clip. At test time, we perform sliding windows and average their scores. Otherwise, using only 1 clip of course reduces the accuracy.

gulvarol avatar Mar 18 '19 15:03 gulvarol

Does sliding windows mean that sliding through 1-60 frames and then 6-65, 11-70 because the stride of 4 frames is given in the paper or does that mean anything else?

ajay9022 avatar Mar 18 '19 16:03 ajay9022

Can you explain the last paragraph of Section 3.3 a bit more? I didn't get it how the cropping is being done?

At test time, a video is divided into t-frame clips with a temporal stride of 4 frames. Each clip is further tested with 10 crops, namely the 4 corners and the center, together with their horizontal flips. The video score is obtained by averaging over clip scores and crop scores. If the number of frames in a video is less than the clip size, we pad the input by repeating the last frames to fill the missing volume.

Also, do different clips in a given video show different actions. Why are we focusing on clips of a video rather than talking about the complete video at a time?

Does the clip size mean the no. of frames in a given clip? Again how has a given video been divided into the clips? Do clips in a video share common frames?

ajay9022 avatar Mar 18 '19 16:03 ajay9022