InternVideo
InternVideo copied to clipboard
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
想其输出某个行为在每一帧的坐标信息,曾经试过用提示词让其输出的坐标,但它回复说不能输出像素坐标值。想知道论文中是如何实现的,有没有大佬能提供一些参考的代码或者思路?谢谢
你好,我按照指引地址下载了MSRVTT,里面的test_list有很多,我想i请问用的是哪一个? 我下载的MSRVTT解压后文件目录如下: annotation high-quality structured-symlinks videos 请问test_1k是哪个文件夹下的哪个文件? 是MSRVTT/structured-symlinks/val_list_jsfusion.txt么
When I try to finetune stage2 of Internvideo2 with num_frames 12, I meet the error below: ```python [rank0]: File "/root/nginx/multi_modality/tasks/shared_utils.py", line 192, in setup_model [rank0]: msg = model_without_ddp.load_state_dict(state_dict, strict=False) [rank0]:...
Hello, I tried running the video text retrieval demo and I'm running into this error: ``` File "/home/saumya/miniconda3/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 481, in checkpoint return CheckpointFunction.apply(function, preserve, *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/saumya/miniconda3/lib/python3.12/site-packages/torch/autograd/function.py", line...
Hi InternVideo2 team! Could you please share a code about how you extract the multi-modal features? I'd like to use the models to extract feature of my own dataset. Thanks...
I would like to use InternVideo2.5 to extract video embeddings. Could you provide a reference script for extracting embeddings, specifically the `hidden_states[-1]` from the LLM's `hidden_states`? Thank you!
Thank you for this video model! I had one question. Is all the temporal modeling in InternVideo2.5 offloaded to the LLM? This is what it appears from the demo provided...
Hi, When I try to run sh eval_msrvtt.sh, I am getting the following error: ------------------------------------------------------ [rank0]: File "/workspace/InternVideo2/multi_modality/tasks/pretrain.py", line 315, in main [rank0]: train_loaders, test_name2loaders, train_media_types = setup_dataloaders( [rank0]: File...
Hi 👋 Thank you for your great work! I'd love to reproduce your results for my future research, but I'm having trouble downloading the VideoMAE feature from the Baidu link...
Such as temporal grounding on QVHighlight and Charade-STA