LLaVA-NeXT
LLaVA-NeXT copied to clipboard
Question regarding using gridded anyres images for interleave inference
First of all, thank you for open-sourcing this great work!
I notice that in the demo code released, though the image_aspect_ratio was set to anyres, the images are processed as a single image resized and padded to 384x384, using the default preprocess method instead of using process_image or process_anyres_image in mm_utils.py. In your recently released paper the multi-patch setting also seems to be only for single image tasks.
I would like to know if interleave inference for higher resolution image with the grid setting is supported? Or will there be performance concern for using the grid-sliced anyres patches? (Well, intuitively, providing several thousands lengthed image tokens, for multiple images will let me think of the infamous "Lost in the middle" issue)