moment_detr icon indicating copy to clipboard operation
moment_detr copied to clipboard

Question about the video encoder ViT

Open Summer-seu opened this issue 1 year ago • 1 comments

Hi,thanks for your great works! I have a question that how you fuse the image features from a 2-seconds clip into a clip video feature, since ViT is a feature extraction model for images not videos.

Summer-seu avatar Sep 16 '23 10:09 Summer-seu

We sample a video frame (an image) every 2 seconds and extract embedding for it.

jayleicn avatar Oct 04 '23 18:10 jayleicn