InternVideo About feature extraction from raw video using InternVideo2

Thank you for great work!

I am currently working on temporal action localization and planning to use InternVideo2-1B and 6B for feature extraction from raw video data that is not available on Hugging Face. However, I am unclear on the exact process about the feature extraction.

Could you please provide guidance or an example on how to extract features from raw video using InternVideo2?

Sep 19 '24 14:09 Dotori-HJ

Hi,

Just follow up this question.

Oct 01 '24 14:10 CrazyGeG

Following this question!

Oct 24 '24 16:10 arushirai1

@Dotori-HJ @CrazyGeG @arushirai1 Sorry for the late reply. I hope this finds you well.

For video feature extraction, you can refer to the script from another one of our projects: extract_tad_feature.py. You just need to switch the model from VideoMAEv2 to InternVideo2. You can find the pretrained model links and configuration details for InternVideo2 here. We uniformly sample 8 frames for each sliding window input to InternVideo2.

For query feature extraction, we use the last hidden state of chinese_alpaca_lora_7b.

Oct 25 '24 02:10 yinanhe

@yinanhe Thank you for replying. I have the following question.

I have a question about the normalization process for the model.

It seems normalization is applied in the training process using mean=0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]. <build.py in InternVideo2> https://github.com/OpenGVLab/InternVideo/blob/eca2cdc5a67d7442063d19963515b5bd0feef627/InternVideo2/single_modality/datasets/build.py#L14-L36

However, during extracting features, it seems that the input video is normalized within the 0 ~ 1 range. <extract_tad_feature.py in VideoMAEv2> https://github.com/OpenGVLab/VideoMAEv2/blob/29eab1e8a588d1b3ec0cdec7b03a86cca491b74b/extract_tad_feature.py#L16-L17 def to_normalized_float_tensor(vid): return vid.permute(3, 0, 1, 2).to(torch.float32) / 255

Could you clarify why there’s a difference in the normalization process between training and feature extraction, and whether this discrepancy affects the extracted features?

Oct 25 '24 08:10 Dotori-HJ

@Dotori-HJ Sorry, my reply is not rigorous enough and caused you trouble. In the progress of data transform, It's still need to follow the transform process of InternVideo2-CLIP. For details, you can refer to https://github.com/OpenGVLab/InternVideo/blob/eca2cdc5a67d7442063d19963515b5bd0feef627/InternVideo2/multi_modality/dataset/init.py#L133-L154

Oct 25 '24 08:10 yinanhe

Thank you for the clarification and guidance. It has been very helpful!

Oct 26 '24 03:10 Dotori-HJ

Hi, have you successfully extracted the video features? I still find it confusing about the configurations of the model, and I failed to load the weights. Could you share me the relevant code? Thank you so much!

Dec 08 '24 14:12 keqizero

Which files did you use for the video and text feature extraction? I also want to extract features from the custom data (OVIS and customized queries), so I don't need the dataset listed in the Datasets actually. But without downloading them, running bash scripts/pretraining/clip/1B/run.sh or bash scripts/pretraining/stage2/1B/run.sh might be failed.

Mar 17 '25 14:03 Shuaicong97

Have you solved it? @keqizero

Mar 17 '25 14:03 Shuaicong97

Have you solved it? @keqizero

No, I haven't

Mar 19 '25 11:03 keqizero

@yinanhe hello, using this way mentioned in link only get the feature of whole video that shape is [#patches, C], how can i get the all clip features in video,such as 150s video， set fps=0.5, the number of clip is 75, how can i use intervideo_clip to get a video feature that shape is [75,768],as the offical features in huggingface, looking foward to your reply

Apr 27 '25 03:04 chaohongguo

@yinanhe hello, using this way mentioned in link only get the feature of whole video that shape is [#patches, C], how can i get the all clip features in video,such as 150s video， set fps=0.5, the number of clip is 75, how can i use intervideo_clip to get a video feature that shape is [75,768],as the offical features in huggingface, looking foward to your reply

same question

Sep 21 '25 08:09 zouyuda220

Hi, have you successfully extracted the video features? I still find it confusing about the configurations of the model, and I failed to load the weights. Could you share me the relevant code? Thank you so much!

same question.

Sep 25 '25 11:09 tkasarla

I have extracted the video features which looks good, but the text feature with right shape are not as good as the features provided.

-----原始邮件----- 发件人:"Tejaswi Kasarla" @.> 发送时间:2025-09-25 19:54:40 (星期四) 收件人: OpenGVLab/InternVideo @.> 抄送: zouyuda @.>, Comment @.> 主题: Re: [OpenGVLab/InternVideo] About feature extraction from raw video using InternVideo2 (Issue #182)

tkasarla left a comment (OpenGVLab/InternVideo#182)

Hi, have you successfully extracted the video features? I still find it confusing about the configurations of the model, and I failed to load the weights. Could you share me the relevant code? Thank you so much!

same question.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Sep 29 '25 03:09 zouyuda220