LLaVA-NeXT The llava-onevision model video inference code has an error

The llava-onevision model video inference code has an error

Open AmazDeng opened this issue 1 year ago • 16 comments

For the llava-onevision model, the official video inference code does not modify the image_aspect_ratio parameter, resulting in the use of the default anyres_max_9. This causes the image_features to occupy a huge amount of GPU memory during inference. Is this problematic? After all, the paper states that each frame consists of 196 tokens, but using anyres_max_9 results in a number of tokens per frame far exceeding 196. Relevant links are as follows:

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Tutorials.ipynb https://github.com/LLaVA-VL/LLaVA-NeXT/issues/142

Additionally, why can't I see the logic for each frame corresponding to 196 tokens in the code?

Aug 14 '24 08:08 AmazDeng

LLaVA-NeXT LLaVA-NeXT copied to clipboard

The llava-onevision model video inference code has an error

LLaVA-NeXT
LLaVA-NeXT copied to clipboard