LLaVA-NeXT Question regarding using gridded anyres images for interleave inference

Question regarding using gridded anyres images for interleave inference

Open LuciusLan opened this issue 1 year ago • 1 comments

First of all, thank you for open-sourcing this great work!

I notice that in the demo code released, though the image_aspect_ratio was set to anyres, the images are processed as a single image resized and padded to 384x384, using the default preprocess method instead of using process_image or process_anyres_image in mm_utils.py. In your recently released paper the multi-patch setting also seems to be only for single image tasks. I would like to know if interleave inference for higher resolution image with the grid setting is supported? Or will there be performance concern for using the grid-sliced anyres patches? (Well, intuitively, providing several thousands lengthed image tokens, for multiple images will let me think of the infamous "Lost in the middle" issue)

Aug 01 '24 11:08 LuciusLan

LLaVA-NeXT LLaVA-NeXT copied to clipboard

Question regarding using gridded anyres images for interleave inference

LLaVA-NeXT
LLaVA-NeXT copied to clipboard