video_features Explain windowing/slicing in extract

Hi,

Thanks for creating this repo, it's really helpful!

Currently, I would like to use s3d to get features for each frame by setting step_size=1 and stack_size=20. When looking at the code in models/s3d/extract_s3d.py, I wasn't sure how the temporal window is determined, as there is no ...timestamps_ms.npy output file as in i3d code..

Looking at the code below, it appears that for a given sample, the window is forward-looking. E.g. for sample 0, the features would be determiend via the window: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], and for sample 5 it would be the window [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]. Is this correct? Seems a bit counter-intuitive that the window is forward-looking, backward-looking or centered around the sample would make more sense to me?

https://github.com/v-iashin/video_features/blob/a2f61b7a4cf0ca6a2d91dcc2182f57e7cfd12664/models/s3d/extract_s3d.py#L60-L69

https://github.com/v-iashin/video_features/blob/a2f61b7a4cf0ca6a2d91dcc2182f57e7cfd12664/utils/utils.py#L62-L71

Appreciate the help! Akseli

Jul 08 '25 11:07 Akseli-Ilmanen

Think of it as following the logic of how a conv block traverses an input (without padding).

so a video of 100 frames, will give you 81 features for the window size of 20 frames, stepping 1 frame each time:

>>> form_slices(size=100, stack_size=20, step_size=1)
[(0, 20), (1, 21), (2, 22), (3, 23), (4, 24), (5, 25), (6, 26), (7, 27), (8, 28), (9, 29), (10, 30), (11, 31), (12, 32), (13, 33), (14, 34), (15, 35), (16, 36), (17, 37), (18, 38), (19, 39), (20, 40), (21, 41), (22, 42), (23, 43), (24, 44), (25, 45), (26, 46), (27, 47), (28, 48), (29, 49), (30, 50), (31, 51), (32, 52), (33, 53), (34, 54), (35, 55), (36, 56), (37, 57), (38, 58), (39, 59), (40, 60), (41, 61), (42, 62), (43, 63), (44, 64), (45, 65), (46, 66), (47, 67), (48, 68), (49, 69), (50, 70), (51, 71), (52, 72), (53, 73), (54, 74), (55, 75), (56, 76), (57, 77), (58, 78), (59, 79), (60, 80), (61, 81), (62, 82), (63, 83), (64, 84), (65, 85), (66, 86), (67, 87), (68, 88), (69, 89), (70, 90), (71, 91), (72, 92), (73, 93), (74, 94), (75, 95), (76, 96), (77, 97), (78, 98), (79, 99), (80, 100)]
>>> len(form_slices(100, 20, 1))
81

You can get this by using show_pred=True tag to the model, too.

I do admit that centring around a sample would make sense if you'd like to have this 'timestamp-feature correspondence'. However, I believe implementing it otherwise might confuse others, let alone the 'boundary effect': what does the lib do for 0--stack_size//2 timestamps and those on the other side of the sequence?

Additionally, one could get the 'backwards-facing' by treating the output features as they come for timestamps: 20, 21, ..., 100. While for the 'centring': 10, 11, ..., 90. So it's just a matter of mapping the outputs to what a user wants without complicating the logic.

Thank you for your question, though

Jul 08 '25 11:07 v-iashin

Thanks for the quick reply! Yes, I think this makes sense.

So, if I want 100 features for 100 frames, I would need to add padding/additional video at the end of my video input? So, if I input a video with [0, 1, 2, ... 119] frames, I would receive a feature vector [0, 1, 2, ..., 99] for the first 100 frames of my video?

Best Akseli

Jul 08 '25 17:07 Akseli-Ilmanen

And would you have any recommendations on how one should do padding here? Add black frames at the end, or copy the last frame 20 times?

Jul 09 '25 05:07 Akseli-Ilmanen

Explain windowing/slicing in extract_s3d.py