Feature extraction code in Googledrive(videoMAE)
Hello:)
Thank you for your fast replying !
In the code "extract_maeVideo_embedding.py", the feature_level argument is set to "UTTERANCE". In this condition, the final VideoMAE embedding averaged by np.mean operation through spatial-temporal axis.
The tensor shape is changed as below: Input:(1,3,16,224,224) -> embedding:(1,1568,1024) -> averaged:(1, 1024)
Is this coed operated as you intended? In the paper, I saw that the feature extracted from local encoder only is operated by average operation.
Thanks!
Hello :)
Yes, this operation aligns with our design. In the actual implementation, the Temporal Encoder also applies an averaging operation.
The reason we emphasize the averaging operation for the local encoder is that it processes only a single facial image at a time. For a video sample, multiple frames of facial images are input, resulting in multiple features, which are then averaged.
In contrast, the Temporal Encoder takes 16 frames as input at once and directly produces the feature, so we did not explicitly emphasize its averaging operation.