VILA icon indicating copy to clipboard operation
VILA copied to clipboard

Use vila-infer to reason among multiple images

Open Hetznero opened this issue 11 months ago • 1 comments

I have seen from a previous issue, that it was able to reason among multiple images (see: https://github.com/NVlabs/VILA/issues/20)

I wanted to try this with vila-infer aswell, however, if I use the following input: --text " Is a image of a man with tattoos, Is a image of a landscape, Is"

I get the warning and as ouput "1": Media token '' found in text: ' Is a image of a man with tattoos, Is a image of a landscape, Is'. Removed.

So I was wondering if vila-infer is able to reasong among multiple images and if so, how do I need to change the text.

Hetznero avatar Dec 20 '24 11:12 Hetznero

can you attach a failed example?

Lyken17 avatar Feb 28 '25 00:02 Lyken17