InternVideo icon indicating copy to clipboard operation
InternVideo copied to clipboard

About feature extraction from raw video using InternVideo2

Open Dotori-HJ opened this issue 1 year ago • 1 comments

Thank you for great work!

I am currently working on temporal action localization and planning to use InternVideo2-1B and 6B for feature extraction from raw video data that is not available on Hugging Face. However, I am unclear on the exact process about the feature extraction.

Could you please provide guidance or an example on how to extract features from raw video using InternVideo2?

Dotori-HJ avatar Sep 19 '24 14:09 Dotori-HJ

Hi,

Just follow up this question.

CrazyGeG avatar Oct 01 '24 14:10 CrazyGeG

Following this question!

arushirai1 avatar Oct 24 '24 16:10 arushirai1

@Dotori-HJ @CrazyGeG @arushirai1 Sorry for the late reply. I hope this finds you well.

For video feature extraction, you can refer to the script from another one of our projects: extract_tad_feature.py. You just need to switch the model from VideoMAEv2 to InternVideo2. You can find the pretrained model links and configuration details for InternVideo2 here. We uniformly sample 8 frames for each sliding window input to InternVideo2.

For query feature extraction, we use the last hidden state of chinese_alpaca_lora_7b.

yinanhe avatar Oct 25 '24 02:10 yinanhe

@yinanhe Thank you for replying. I have the following question.

I have a question about the normalization process for the model.

It seems normalization is applied in the training process using mean=0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]. <build.py in InternVideo2> https://github.com/OpenGVLab/InternVideo/blob/eca2cdc5a67d7442063d19963515b5bd0feef627/InternVideo2/single_modality/datasets/build.py#L14-L36

However, during extracting features, it seems that the input video is normalized within the 0 ~ 1 range. <extract_tad_feature.py in VideoMAEv2> https://github.com/OpenGVLab/VideoMAEv2/blob/29eab1e8a588d1b3ec0cdec7b03a86cca491b74b/extract_tad_feature.py#L16-L17 def to_normalized_float_tensor(vid): return vid.permute(3, 0, 1, 2).to(torch.float32) / 255

Could you clarify why there’s a difference in the normalization process between training and feature extraction, and whether this discrepancy affects the extracted features?

Dotori-HJ avatar Oct 25 '24 08:10 Dotori-HJ

@Dotori-HJ Sorry, my reply is not rigorous enough and caused you trouble. In the progress of data transform, It's still need to follow the transform process of InternVideo2-CLIP. For details, you can refer to https://github.com/OpenGVLab/InternVideo/blob/eca2cdc5a67d7442063d19963515b5bd0feef627/InternVideo2/multi_modality/dataset/init.py#L133-L154

yinanhe avatar Oct 25 '24 08:10 yinanhe

Thank you for the clarification and guidance. It has been very helpful!

Dotori-HJ avatar Oct 26 '24 03:10 Dotori-HJ

Hi, have you successfully extracted the video features? I still find it confusing about the configurations of the model, and I failed to load the weights. Could you share me the relevant code? Thank you so much!

keqizero avatar Dec 08 '24 14:12 keqizero

Which files did you use for the video and text feature extraction? I also want to extract features from the custom data (OVIS and customized queries), so I don't need the dataset listed in the Datasets actually. But without downloading them, running bash scripts/pretraining/clip/1B/run.sh or bash scripts/pretraining/stage2/1B/run.sh might be failed.

Shuaicong97 avatar Mar 17 '25 14:03 Shuaicong97

Have you solved it? @keqizero

Shuaicong97 avatar Mar 17 '25 14:03 Shuaicong97

Have you solved it? @keqizero

No, I haven't

keqizero avatar Mar 19 '25 11:03 keqizero

@yinanhe hello, using this way mentioned in link only get the feature of whole video that shape is [#patches, C], how can i get the all clip features in video,such as 150s video, set fps=0.5, the number of clip is 75, how can i use intervideo_clip to get a video feature that shape is [75,768],as the offical features in huggingface, looking foward to your reply

chaohongguo avatar Apr 27 '25 03:04 chaohongguo

@yinanhe hello, using this way mentioned in link only get the feature of whole video that shape is [#patches, C], how can i get the all clip features in video,such as 150s video, set fps=0.5, the number of clip is 75, how can i use intervideo_clip to get a video feature that shape is [75,768],as the offical features in huggingface, looking foward to your reply

same question

zouyuda220 avatar Sep 21 '25 08:09 zouyuda220

Hi, have you successfully extracted the video features? I still find it confusing about the configurations of the model, and I failed to load the weights. Could you share me the relevant code? Thank you so much!

same question.

tkasarla avatar Sep 25 '25 11:09 tkasarla

I have extracted the video features which looks good, but the text feature with right shape are not as good as the features provided.

-----原始邮件----- 发件人:"Tejaswi Kasarla" @.> 发送时间:2025-09-25 19:54:40 (星期四) 收件人: OpenGVLab/InternVideo @.> 抄送: zouyuda @.>, Comment @.> 主题: Re: [OpenGVLab/InternVideo] About feature extraction from raw video using InternVideo2 (Issue #182)

tkasarla left a comment (OpenGVLab/InternVideo#182)

Hi, have you successfully extracted the video features? I still find it confusing about the configurations of the model, and I failed to load the weights. Could you share me the relevant code? Thank you so much!

same question.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

zouyuda220 avatar Sep 29 '25 03:09 zouyuda220