VILA icon indicating copy to clipboard operation
VILA copied to clipboard

[Feature Request] Evaluation tools of the Few-shot VQA/Caption

Open Li-Qingyun opened this issue 1 year ago • 6 comments

Hi, I'm interested in your great work.

The ./scripts/v1_5/eval/eval_all.sh is not avalilable now. Could you release the evaluation tools? Especially the few-shot VQA/Caption.

And the mmc4 pretrained weight is wished to be availiable.

dataset_mixture of new_vflan_sharegpt4v_sft is also not availiable.

Ty very much !

Li-Qingyun avatar Mar 06 '24 11:03 Li-Qingyun

Thanks for the authors' support. I found the ./scripts/v1_5/eval/eval_all.sh has been availiable.

The evaluation tools of the Few-shot VQA/Caption is also essential for the researchers following this work. Looking forward to the release of this part.

Ty very much !

Li-Qingyun avatar Mar 07 '24 00:03 Li-Qingyun

Hi Qingyun,

Which evaluation scripts you are looking for VQA and caption? Current eval_all.sh should cover all metrics in the paper.

Lyken17 avatar Mar 07 '24 08:03 Lyken17

Hi Qingyun,

Which evaluation scripts you are looking for VQA and caption? Current eval_all.sh should cover all metrics in the paper.

@Lyken17

Thanks for your reply! I'm looking forward to the Few-shot OKVQA/TextVQA/CocoCaption/FlickrCaption in the ablation study of Table 1/3. :pray::pray: Best Regards.

Li-Qingyun avatar Mar 07 '24 11:03 Li-Qingyun

@Lyken17 I'm writing to request evaluation tools of the Few-shot VQA/Caption (Specifically, 4-shots OKVQA/TextVQA/CocoCaption/FlickrCaption in the ablation study of VILA Table 1/3).

The experimental results validated that: when used for pre-training Llava-like MLLMs, image-text interleaved data (MMC4) achieves better few-shot VQA/Caption results than image-text pairs data (COYO/LAION...). I tried to eval the Few-shot VQA scores of the open-source VILA-7B weight, but i did not get the same conclusion.

okvqa 0-shot: 61.05 1-shot: 56.93 2-shot: 56.84 4-shot: 56.47 textvqa 0-shot: 62.64 1-shot: 60.73 2-shot: 60.45 4-shot: 60.88

I realize that my implementation may not work for validating the few-shots performance, so I wish you to consider releasing the evaluation tool, since you seem to have become the major contributor of this open source repository. It will be a great help to my research and I will be very grateful to you.

Best regards, Qingyun.

Details of my implementation has been sent to your email.

Li-Qingyun avatar Mar 16 '24 12:03 Li-Qingyun

cc' @kentang-mit and @Seerkfang who are more familar with evaluation scripts.

Lyken17 avatar Mar 21 '24 16:03 Lyken17

cc' @kentang-mit and @Seerkfang who are more familar with evaluation scripts.

@Lyken17 Okkk, thanks for your reply!

Dear @kentang-mit and @Seerkfang:

Could you please share few-shot evaluation scrips?

It will be a great help to my research and I will be very grateful to you.

In few-show VQA/Caption results of the VILA paper, compared to the decline of image-text pair pre-training, the promotion of interleaved image-text pre-training is an essential reason for VILA to add stage2. Stage2 seems to make SFT model better few-shot learning performance, which can also serve as a rebuttal to the point of #12 .

Li-Qingyun avatar Mar 22 '24 00:03 Li-Qingyun