video_features icon indicating copy to clipboard operation
video_features copied to clipboard

Explain windowing/slicing in extract_s3d.py

Open Akseli-Ilmanen opened this issue 9 months ago • 3 comments

Hi,

Thanks for creating this repo, it's really helpful!

Currently, I would like to use s3d to get features for each frame by setting step_size=1 and stack_size=20. When looking at the code in models/s3d/extract_s3d.py, I wasn't sure how the temporal window is determined, as there is no ...timestamps_ms.npy output file as in i3d code..

Looking at the code below, it appears that for a given sample, the window is forward-looking. E.g. for sample 0, the features would be determiend via the window: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], and for sample 5 it would be the window [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]. Is this correct? Seems a bit counter-intuitive that the window is forward-looking, backward-looking or centered around the sample would make more sense to me?

https://github.com/v-iashin/video_features/blob/a2f61b7a4cf0ca6a2d91dcc2182f57e7cfd12664/models/s3d/extract_s3d.py#L60-L69

https://github.com/v-iashin/video_features/blob/a2f61b7a4cf0ca6a2d91dcc2182f57e7cfd12664/utils/utils.py#L62-L71

Appreciate the help! Akseli

Akseli-Ilmanen avatar Jul 08 '25 11:07 Akseli-Ilmanen

Think of it as following the logic of how a conv block traverses an input (without padding).

so a video of 100 frames, will give you 81 features for the window size of 20 frames, stepping 1 frame each time:

>>> form_slices(size=100, stack_size=20, step_size=1)
[(0, 20), (1, 21), (2, 22), (3, 23), (4, 24), (5, 25), (6, 26), (7, 27), (8, 28), (9, 29), (10, 30), (11, 31), (12, 32), (13, 33), (14, 34), (15, 35), (16, 36), (17, 37), (18, 38), (19, 39), (20, 40), (21, 41), (22, 42), (23, 43), (24, 44), (25, 45), (26, 46), (27, 47), (28, 48), (29, 49), (30, 50), (31, 51), (32, 52), (33, 53), (34, 54), (35, 55), (36, 56), (37, 57), (38, 58), (39, 59), (40, 60), (41, 61), (42, 62), (43, 63), (44, 64), (45, 65), (46, 66), (47, 67), (48, 68), (49, 69), (50, 70), (51, 71), (52, 72), (53, 73), (54, 74), (55, 75), (56, 76), (57, 77), (58, 78), (59, 79), (60, 80), (61, 81), (62, 82), (63, 83), (64, 84), (65, 85), (66, 86), (67, 87), (68, 88), (69, 89), (70, 90), (71, 91), (72, 92), (73, 93), (74, 94), (75, 95), (76, 96), (77, 97), (78, 98), (79, 99), (80, 100)]
>>> len(form_slices(100, 20, 1))
81

You can get this by using show_pred=True tag to the model, too.

I do admit that centring around a sample would make sense if you'd like to have this 'timestamp-feature correspondence'. However, I believe implementing it otherwise might confuse others, let alone the 'boundary effect': what does the lib do for 0--stack_size//2 timestamps and those on the other side of the sequence?

Additionally, one could get the 'backwards-facing' by treating the output features as they come for timestamps: 20, 21, ..., 100. While for the 'centring': 10, 11, ..., 90. So it's just a matter of mapping the outputs to what a user wants without complicating the logic.

Thank you for your question, though

v-iashin avatar Jul 08 '25 11:07 v-iashin

Thanks for the quick reply! Yes, I think this makes sense.

So, if I want 100 features for 100 frames, I would need to add padding/additional video at the end of my video input? So, if I input a video with [0, 1, 2, ... 119] frames, I would receive a feature vector [0, 1, 2, ..., 99] for the first 100 frames of my video?

Best Akseli

Akseli-Ilmanen avatar Jul 08 '25 17:07 Akseli-Ilmanen

And would you have any recommendations on how one should do padding here? Add black frames at the end, or copy the last frame 20 times?

Akseli-Ilmanen avatar Jul 09 '25 05:07 Akseli-Ilmanen