Wenhao Wu
Wenhao Wu
Transferring visual statistic knowledge: 对于 Kinetics-400 数据集的实验,我们从每个类别中采样了 60 个视频,大约占训练数据的 10%。这些视频都直接送给CLIP的visual encoder来得到video embeddings。利用这些embeddings和其对应的label,我们可以用LDA得到LDA coefficient,再用其作为classifier. Transferring textual semantic knowledge: 用BERT直接对category names抽取text embeddings,并作为classifier.
这里的4x3 view仅存在于test阶段,在运行指令后加入相关后缀即可,如--test_crops 3 --test_clips 4,和config中的num_sample无关啦。 `sh scripts/run_test.sh configs/k400/k400_train_rgb_vitb-32-f8.yaml exp/k400/ViT-B/32/f8/last_model.pt --test_crops 3 --test_clips 4 `
感谢对我们工作的兴趣。 1. 不清楚您指的是什么数据集上的结果? 2. 关于logit_scale请参考CLIP官方代码https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/model.py#L295
> Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of...
Yes, that's natural. I've already been experimenting with more MLLMs and will be releasing the results recently.
For LLaVA-1.6, it uses both base features (336x336 resolution) and higher resolution features. To perform inference similar to 1.5, you only need to use the base features to avoid introducing...
I have just updated the code for LLaVA-1.6. Just one line. You can check it out :)
Of course! I'm getting married next week, so I plan to update arXiv with these results in early June after that.
Thank you for your reminder. I have get all the checkpoint links in https://unisyd-my.sharepoint.com/:f:/g/personal/wenhao_wu_sydney_edu_au/EieZBg9a40VAhSIVl6ovAIIBaCuzYamkfE1dMn6MxjjwGg?e=JcqYk4