VILA
VILA copied to clipboard
Use vila-infer to reason among multiple images
I have seen from a previous issue, that it was able to reason among multiple images (see: https://github.com/NVlabs/VILA/issues/20)
I wanted to try this with vila-infer aswell, however, if I use the following input:
--text "
I get the warning and as ouput "1":
Media token '
So I was wondering if vila-infer is able to reasong among multiple images and if so, how do I need to change the text.
can you attach a failed example?