moment_detr
moment_detr copied to clipboard
Question about the video encoder ViT
Hi,thanks for your great works! I have a question that how you fuse the image features from a 2-seconds clip into a clip video feature, since ViT is a feature extraction model for images not videos.
We sample a video frame (an image) every 2 seconds and extract embedding for it.