VILA
VILA copied to clipboard
[Feature Request] Evaluation tools of the Few-shot VQA/Caption
Hi, I'm interested in your great work.
The ./scripts/v1_5/eval/eval_all.sh is not avalilable now. Could you release the evaluation tools? Especially the few-shot VQA/Caption.
And the mmc4 pretrained weight is wished to be availiable.
dataset_mixture of new_vflan_sharegpt4v_sft is also not availiable.
Ty very much !
Thanks for the authors' support.
I found the ./scripts/v1_5/eval/eval_all.sh has been availiable.
The evaluation tools of the Few-shot VQA/Caption is also essential for the researchers following this work. Looking forward to the release of this part.
Ty very much !
Hi Qingyun,
Which evaluation scripts you are looking for VQA and caption? Current eval_all.sh should cover all metrics in the paper.
Hi Qingyun,
Which evaluation scripts you are looking for VQA and caption? Current
eval_all.shshould cover all metrics in the paper.
@Lyken17
Thanks for your reply! I'm looking forward to the Few-shot OKVQA/TextVQA/CocoCaption/FlickrCaption in the ablation study of Table 1/3. :pray::pray: Best Regards.
@Lyken17 I'm writing to request evaluation tools of the Few-shot VQA/Caption (Specifically, 4-shots OKVQA/TextVQA/CocoCaption/FlickrCaption in the ablation study of VILA Table 1/3).
The experimental results validated that: when used for pre-training Llava-like MLLMs, image-text interleaved data (MMC4) achieves better few-shot VQA/Caption results than image-text pairs data (COYO/LAION...). I tried to eval the Few-shot VQA scores of the open-source VILA-7B weight, but i did not get the same conclusion.
okvqa 0-shot: 61.05 1-shot: 56.93 2-shot: 56.84 4-shot: 56.47 textvqa 0-shot: 62.64 1-shot: 60.73 2-shot: 60.45 4-shot: 60.88
I realize that my implementation may not work for validating the few-shots performance, so I wish you to consider releasing the evaluation tool, since you seem to have become the major contributor of this open source repository. It will be a great help to my research and I will be very grateful to you.
Best regards, Qingyun.
Details of my implementation has been sent to your email.
cc' @kentang-mit and @Seerkfang who are more familar with evaluation scripts.
cc' @kentang-mit and @Seerkfang who are more familar with evaluation scripts.
@Lyken17 Okkk, thanks for your reply!
Dear @kentang-mit and @Seerkfang:
Could you please share few-shot evaluation scrips?
It will be a great help to my research and I will be very grateful to you.
In few-show VQA/Caption results of the VILA paper, compared to the decline of image-text pair pre-training, the promotion of interleaved image-text pre-training is an essential reason for VILA to add stage2. Stage2 seems to make SFT model better few-shot learning performance, which can also serve as a rebuttal to the point of #12 .