InternVideo Similarity Scores coming very low between the video and the text features.

Hi @leexinhao, I am trying the text to video retrieval on my dataset using this https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/demo_video_text_retrieval.ipynb, but the similarity scores are coming in very low between the text_features and the video_features. I am using this weight file InternVideo2-stage2_1b-224p-f4.pt and the cosine similarity I am computing by taking the dot product between the text_features and the video_features(text_features @ video_features.T). array([0.08640765, 0.08618326, 0.08596011, 0.08578135, 0.08574679, 0.08564241, 0.08557957, 0.08552065, 0.08551717, 0.08548111], dtype=float32)

Thanks.

Feb 20 '25 09:02 rishabh-akridata

I solved this bug. The reason was that the model weights were not loaded correctly. Make sure that the "pretrained_path" in internvideo2_stage2_config.py is correctly assigned. This point is not mentioned in the repo's DEMO_USAGE_GUIDE.

Mar 01 '25 02:03 UnableToUseGit

I solved this bug. The reason was that the model weights were not loaded correctly. Make sure that the "pretrained_path" in internvideo2_stage2_config.py is correctly assigned. This point is not mentioned in the repo's DEMO_USAGE_GUIDE.

Thanks a lot！

Apr 24 '25 08:04 MxLearner