VILA
VILA copied to clipboard
Multi-image is worse than concat them as single image.
Hello, I have tried your code and pretrained models, its a very excellent work.
But I meet a issue about multi-image task.
My input single image width: height = 1 : 2,concating these two images, then concated image width: height = 1 : 1. The Acc is 95% by this way.
When input two images(width: height = 1 : 2) to the VILA, the Acc drop dramatically to 85%.
These two ways of input, all using data fine-tuning the pretrained model.
It is an interesting experiments. Are you just inference with pretrained weight? or you've tuned VILA with the two strategies? Can you provide your example?
This depends on the downstream benchmarks, as VILA clips and resize image 336x336, so concated images will lead to performance drop if detailed information is needed (e.g., OCR).
Could you share your evaluation settings and benchmark images?
the issue has been non-active for a while. Feel free to reopen if the issue still exists