retrieve_text demo for internvideo2 multi_modality

Open shyern opened this issue 2 months ago • 0 comments

setting model.eval() in demo code setting correct model path in internvideo2_stage2_config.py in two places: model.vision_encoder.pretrained and pretrained_path

I have made the changes based on the advice and obtained the demo results shown below. However, I’m not sure if they are correct.

When I test the model with a video showing a person riding a bicycle and several text descriptions, such as ['a person running in a campus walkway.', 'a van or vehicle driving in the campus pedestrian area.', 'a person is riding a bicycle.'], the retrieval results seem poor—the correct sentence gets a lower similarity score. I’ve been confused about this issue for several days. This is one frame of the video.

Could anyone please help me figure it out? Or could someone share the new WeChat group QR code?

Nov 09 '25 14:11 shyern