LLaVA-NeXT icon indicating copy to clipboard operation
LLaVA-NeXT copied to clipboard

Question regarding using gridded anyres images for interleave inference

Open LuciusLan opened this issue 1 year ago • 1 comments

First of all, thank you for open-sourcing this great work!

I notice that in the demo code released, though the image_aspect_ratio was set to anyres, the images are processed as a single image resized and padded to 384x384, using the default preprocess method instead of using process_image or process_anyres_image in mm_utils.py. In your recently released paper the multi-patch setting also seems to be only for single image tasks. I would like to know if interleave inference for higher resolution image with the grid setting is supported? Or will there be performance concern for using the grid-sliced anyres patches? (Well, intuitively, providing several thousands lengthed image tokens, for multiple images will let me think of the infamous "Lost in the middle" issue)

LuciusLan avatar Aug 01 '24 11:08 LuciusLan