VILA
VILA copied to clipboard
Question on Multi-Image Input Processing During Training
I encountered confusion while reading the code for handling multi-image inputs, particularly in the following sections: https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L127
The nested for loops starting at
https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L168
and
https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L198
seem to iterate over all image_features[cur_image_idx]. This iteration suggests that the first dimension's size (or length, if a list) of image_features should equal the product of batch_size and num_images. Therefore, it appears that the flatten operation should apply across these two dimensions rather than between num_images and the subsequent token channel dimension. This leads me to question my understanding of the process. Could you clarify where my confusion may lie? Additionally, I'd appreciate more insights into the expected layout of multi-image inputs and how this layout is manipulated in the code, specifically in
https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L122-L127
Thank you very much for your assistance.