VILA Multi-image is worse than concat them as single image.

Multi-image is worse than concat them as single image.

Open liuweijie19980216 opened this issue 1 year ago • 2 comments

Hello, I have tried your code and pretrained models, its a very excellent work.

But I meet a issue about multi-image task.

My input single image width: height = 1 : 2，concating these two images, then concated image width: height = 1 : 1. The Acc is 95% by this way.

When input two images(width: height = 1 : 2) to the VILA, the Acc drop dramatically to 85%.

These two ways of input, all using data fine-tuning the pretrained model.

Mar 22 '24 06:03 liuweijie19980216

It is an interesting experiments. Are you just inference with pretrained weight? or you've tuned VILA with the two strategies? Can you provide your example?

Mar 25 '24 00:03 Li-Qingyun

This depends on the downstream benchmarks, as VILA clips and resize image 336x336, so concated images will lead to performance drop if detailed information is needed (e.g., OCR).

Could you share your evaluation settings and benchmark images?

Mar 27 '24 23:03 Lyken17

the issue has been non-active for a while. Feel free to reopen if the issue still exists

Feb 25 '25 09:02 Lyken17

VILA VILA copied to clipboard

Multi-image is worse than concat them as single image.

VILA
VILA copied to clipboard