About feature extraction from raw video using InternVideo2
Thank you for great work!
I am currently working on temporal action localization and planning to use InternVideo2-1B and 6B for feature extraction from raw video data that is not available on Hugging Face. However, I am unclear on the exact process about the feature extraction.
Could you please provide guidance or an example on how to extract features from raw video using InternVideo2?
Hi,
Just follow up this question.
Following this question!
@Dotori-HJ @CrazyGeG @arushirai1 Sorry for the late reply. I hope this finds you well.
For video feature extraction, you can refer to the script from another one of our projects: extract_tad_feature.py. You just need to switch the model from VideoMAEv2 to InternVideo2. You can find the pretrained model links and configuration details for InternVideo2 here. We uniformly sample 8 frames for each sliding window input to InternVideo2.
For query feature extraction, we use the last hidden state of chinese_alpaca_lora_7b.
@yinanhe Thank you for replying. I have the following question.
I have a question about the normalization process for the model.
It seems normalization is applied in the training process using mean=0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]. <build.py in InternVideo2> https://github.com/OpenGVLab/InternVideo/blob/eca2cdc5a67d7442063d19963515b5bd0feef627/InternVideo2/single_modality/datasets/build.py#L14-L36
However, during extracting features, it seems that the input video is normalized within the 0 ~ 1 range.
<extract_tad_feature.py in VideoMAEv2>
https://github.com/OpenGVLab/VideoMAEv2/blob/29eab1e8a588d1b3ec0cdec7b03a86cca491b74b/extract_tad_feature.py#L16-L17
def to_normalized_float_tensor(vid): return vid.permute(3, 0, 1, 2).to(torch.float32) / 255
Could you clarify why there’s a difference in the normalization process between training and feature extraction, and whether this discrepancy affects the extracted features?
@Dotori-HJ Sorry, my reply is not rigorous enough and caused you trouble. In the progress of data transform, It's still need to follow the transform process of InternVideo2-CLIP. For details, you can refer to https://github.com/OpenGVLab/InternVideo/blob/eca2cdc5a67d7442063d19963515b5bd0feef627/InternVideo2/multi_modality/dataset/init.py#L133-L154
Thank you for the clarification and guidance. It has been very helpful!
Hi, have you successfully extracted the video features? I still find it confusing about the configurations of the model, and I failed to load the weights. Could you share me the relevant code? Thank you so much!
Which files did you use for the video and text feature extraction? I also want to extract features from the custom data (OVIS and customized queries), so I don't need the dataset listed in the Datasets actually. But without downloading them, running bash scripts/pretraining/clip/1B/run.sh or bash scripts/pretraining/stage2/1B/run.sh might be failed.
Have you solved it? @keqizero
@yinanhe hello, using this way mentioned in link only get the feature of whole video that shape is [#patches, C], how can i get the all clip features in video,such as 150s video, set fps=0.5, the number of clip is 75, how can i use intervideo_clip to get a video feature that shape is [75,768],as the offical features in huggingface, looking foward to your reply
@yinanhe hello, using this way mentioned in link only get the feature of whole video that shape is [#patches, C], how can i get the all clip features in video,such as 150s video, set fps=0.5, the number of clip is 75, how can i use intervideo_clip to get a video feature that shape is [75,768],as the offical features in huggingface, looking foward to your reply
same question
Hi, have you successfully extracted the video features? I still find it confusing about the configurations of the model, and I failed to load the weights. Could you share me the relevant code? Thank you so much!
same question.
I have extracted the video features which looks good, but the text feature with right shape are not as good as the features provided.
-----原始邮件----- 发件人:"Tejaswi Kasarla" @.> 发送时间:2025-09-25 19:54:40 (星期四) 收件人: OpenGVLab/InternVideo @.> 抄送: zouyuda @.>, Comment @.> 主题: Re: [OpenGVLab/InternVideo] About feature extraction from raw video using InternVideo2 (Issue #182)
tkasarla left a comment (OpenGVLab/InternVideo#182)
Hi, have you successfully extracted the video features? I still find it confusing about the configurations of the model, and I failed to load the weights. Could you share me the relevant code? Thank you so much!
same question.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>