LLaVA-NeXT
LLaVA-NeXT copied to clipboard
The llava-onevision model video inference code has an error
For the llava-onevision model, the official video inference code does not modify the image_aspect_ratio parameter, resulting in the use of the default anyres_max_9. This causes the image_features to occupy a huge amount of GPU memory during inference. Is this problematic? After all, the paper states that each frame consists of 196 tokens, but using anyres_max_9 results in a number of tokens per frame far exceeding 196. Relevant links are as follows:
https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Tutorials.ipynb https://github.com/LLaVA-VL/LLaVA-NeXT/issues/142
Additionally, why can't I see the logic for each frame corresponding to 196 tokens in the code?