InternVideo icon indicating copy to clipboard operation
InternVideo copied to clipboard

Training and Evaluation Code for ViClip

Open fmthoker opened this issue 1 year ago • 11 comments

Dear authors, Great work and thanks for releasing the code for ViClip pretraining on InternVid-10M-FLT. Firstly, It would be really great if the pre-trainning instructions are more detailed, like which clip models to start from, paths for config etc. Secondlly, can you please also release the evaluation code and scripts for evaluating pretrained ViCLIP models for zero shot kinetics-400, ssv2, ucf etc. I want to reproduce the number for zero-shot evaluation in my local setup.

Thanks and Regards

fmthoker avatar May 30 '24 06:05 fmthoker

Hi! For the zero-shot evaluation, you can refer to the VideoCLIP in InternVideo2.

Andy1621 avatar May 31 '24 01:05 Andy1621

@Andy1621 Thanks for the quick response, are you referring to the scripts in InternVideo/InternVideo2/multi_modality/scripts/evaluation/clip/zero_shot, if so, it seems they are for evaluating InternVideo2 clip. Would the scripts and code work off-the-shelf for not ViClip models that you have shared? Do we need to make any changes? It would also be great if you can share the eval code for ViClip directly. Thanks in advance.

fmthoker avatar Jun 05 '24 07:06 fmthoker

Hi~ You can find the evaluation sctipets here

Andy1621 avatar Jun 05 '24 07:06 Andy1621

@Andy1621 Thanks for you quick response, will try that to reproduce the results.

fmthoker avatar Jun 05 '24 08:06 fmthoker

@Andy1621 I tried to do zero-shot eval on msrvtt-1k with scrpts from here However, I am getting the following errors File "tasks/retrieval.py", line 15, in Traceback (most recent call last): File "tasks/retrieval.py", line 15, in from models.vindlu import VindLU ModuleNotFoundError: No module named 'models.vindlu' from models.vindlu import VindLU ModuleNotFoundError: No module named 'models.vindlu'

fmthoker avatar Jun 05 '24 19:06 fmthoker

I think it's a bug when cleaning the code, you can fix it in tasks/retrieval.py by

# from models.vindlu import VindLU
# from models.vindlu_vit import VindLU_VIT
# from models.vindlu_videoclip import VindLU_VideoCLIP
# from models.vindlu_blip_qformer import VindLU_BLIP_QFormer
from models.viclip import ViCLIP

And also change the model in config.py form VindLU_VideoCLIP to ViCLIP.

Andy1621 avatar Jun 06 '24 02:06 Andy1621

@Andy1621 Thanks, it solves the problem, however i think the code is still not complete as i get following error:

Traceback (most recent call last): File "tasks/retrieval.py", line 292, in main(cfg) File "tasks/retrieval.py", line 208, in main res = evaluation_wrapper( File "/ibex/ai/home/thokerfm/anaconda3/envs/viclip/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 85, in evaluation_wrapper i2t_x, t2i_x, i2t_emb, t2i_emb = evaluation( File "/ibex/ai/home/thokerfm/anaconda3/envs/viclip/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 132, in evaluation image_feats, pooled_image_feats = extract_vision_feats( File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 54, in extract_vision_feats image_feat, pooled_image_feat = model.encode_vision(image, test=True) ValueError: too many values to unpack (expected 2)

fmthoker avatar Jun 06 '24 06:06 fmthoker

@Andy1621 Thanks, it solves the problem, however i think the code is still not complete as i get following error:

Traceback (most recent call last): File "tasks/retrieval.py", line 292, in main(cfg) File "tasks/retrieval.py", line 208, in main res = evaluation_wrapper( File "/ibex/ai/home/thokerfm/anaconda3/envs/viclip/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 85, in evaluation_wrapper i2t_x, t2i_x, i2t_emb, t2i_emb = evaluation( File "/ibex/ai/home/thokerfm/anaconda3/envs/viclip/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 132, in evaluation image_feats, pooled_image_feats = extract_vision_feats( File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 54, in extract_vision_feats image_feat, pooled_image_feat = model.encode_vision(image, test=True) ValueError: too many values to unpack (expected 2)

Did you solve this problem? I got the same error.

Code-kunkun avatar Jun 23 '24 12:06 Code-kunkun

@Code-kunkun Yes, you need to change line 79 in tasks/retrieval_utils.py https://github.com/OpenGVLab/InternVideo/blob/10183826112bd7edd983b68b6d7a5faa5d370709/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py#L79 to if config.model.model_cls == "VindLU_VideoCLIP" or config.model.model_cls == "ViCLIP" Let me know if that works

fmthoker avatar Jun 23 '24 13:06 fmthoker

@Code-kunkun Yes, you need to change line 79 in tasks/retrieval_utils.py

https://github.com/OpenGVLab/InternVideo/blob/10183826112bd7edd983b68b6d7a5faa5d370709/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py#L79

to if config.model.model_cls == "VindLU_VideoCLIP" or config.model.model_cls == "ViCLIP" Let me know if that works

Thanks for your quick reply! It works🥳.

Code-kunkun avatar Jun 23 '24 13:06 Code-kunkun

@Andy1621 Thanks for your help so far with the zero-shot evaluation, can you please refer to me which scripts/code to use for full fine-tuning of the ViCLIP models? Also, how do we run full finetuning for action classification datasets like ssv2, and kinetics with the current codebase?

fmthoker avatar Jun 30 '24 07:06 fmthoker