Video-LLaMA
Video-LLaMA copied to clipboard
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
你好,我根据代码尝试复现stage2的效果,发现了以下现象: 0. 数据均使用代码中声明的cc_sbu_align、llava_instruct、webvid_instruct三个数据集 1. 使用repo中给出的pretrain_vicuna7b-v2.pth,可以顺利复现出正常的vicuna7b_stage2的效果。此时观察到cc_sbu_align的loss收敛到0.1附近 2. 使用repo中给出的pretrain-vicuna13b.pth,训练出的stage2模型对图片和视频的识别能力很差,经常答非所问。如果观察loss,cc_sbu_align的loss大概在0.7-0.9浮动 以上两个实验除了llm和ckpt外,无任何超参区别。请问关于13b的finetune是有什么特殊的调参技巧吗?
后续有加deppspeed的计划吗,我看目前llama的参数是冻结的,试了下放开训的话a100的卡batchsize是1都跑不动
有的model不确定是否下载正确
Hi, I want to generate instruct data on my dataset with GPT4. But I don't know how to write the code. And I also notice that there is rate limit...
Will the performance be worse?
please check the api endpoints are having issues
Hi, I have a question about audio input. In "Video-LLaMA/video_llama/conversation/conversation_video.py line 255", I think the input of this function (load_and_transform_audio_data) should be an audio file (.wav), why is your input...