Coarse-Fine-Networks Where to set number of frames

Hello, I recently read your paper and it states that for "Charades dataset, we configure our network to use T = 64, T= 128 and α = 1/4." Could you point me to where in the code the number of frames are set to 128?

I've only seen the frames parameter being set to 80 in:

train_fine.py on line 57 (80x4)
extract_fineFEAT.py on line 61
train_coarse_fineFEAT.py line 60 (80x4)

Would you mind clarifying why the value is 80? I'm trying to train my own dataset and would like to make sure I adjust the number of frames at the correct points in the code. I would greatly appreciate it if you could point me in the right direction :)

Mar 17 '23 12:03 margauxbowditch

Hi,

Thank you for your interest in our work and sorry about the confusion. It is true that we use T=64 in the Coarse stream and T=128 in the Fine stream inputs. The value that you mention, 320 (80x4), is the number of frames we consider at the original frame rate. However, we sample at a lower frame rate (gramma_tau=5), which results in 64 frames by default (320=64x5). Similarly, when we pre-extract Fine features and feed them to the two-stream model, we extract features for such 128 frames.

ps: gamma_tau becomes 10 (i.e., 5 -> 10) within the dataset file, but we still sample 64/128 frames (corresponding temporal receptive field increases to 640/1280 frames at the original frame-rate)

However, we trained these hyperparameters so that they work best for Charades. For your own dataset, you can play around with it. Generally, our backbone X3D can work with frame-rates as low as 2.5FPS (25 original FPS / 10). And the number of frames in the Fine-stream better cover the whole video if possible. On Charades, 128 frames at this lower frame-rate is sufficient to cover the full temporal duration of >90% of videos.

Does this make things clear?

Thanks!

Apr 03 '23 06:04 kkahatapitiya

Hello,

Thank you very much for your response to my question.

I do have one follow up question. It has to do with the extract_fineFEAT.py script lines 81 and 86. 'testing' was set for both dataloaders. This causes an issue when I run the next train_coarse_fineFEAT.py script as it cannot find features for the training data. How do you propose one should fix this? Can one set line 81 to be 'training' instead of 'testing' so that features will be extracted for all the data?

Your help is much appreciated!

May 23 '23 12:05 margauxbowditch

Sorry about the delay in response. The purpose of using testing flag for both train/val splits when extracting features, is to avoid any random sampling and augmentations that apply when using training flag. Extracted features correspond to actual inputs as they are, not augmented versions. However, extract_fineFEAT.py extract features for both train/val splits. Can you verify this is the case for your data? As long as you extract features with testing flag for both your train/val splits, you should be fine.

Jul 07 '23 16:07 kkahatapitiya

Coarse-Fine-Networks Coarse-Fine-Networks copied to clipboard

Where to set number of frames

Coarse-Fine-Networks
Coarse-Fine-Networks copied to clipboard