VILA Question on Multi-Image Input Processing During Training

Question on Multi-Image Input Processing During Training

Open gaozhihan opened this issue 1 year ago • 0 comments

I encountered confusion while reading the code for handling multi-image inputs, particularly in the following sections: https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L127

The nested for loops starting at https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L168 and https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L198 seem to iterate over all image_features[cur_image_idx]. This iteration suggests that the first dimension's size (or length, if a list) of image_features should equal the product of batch_size and num_images. Therefore, it appears that the flatten operation should apply across these two dimensions rather than between num_images and the subsequent token channel dimension. This leads me to question my understanding of the process. Could you clarify where my confusion may lie? Additionally, I'd appreciate more insights into the expected layout of multi-image inputs and how this layout is manipulated in the code, specifically in https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L122-L127

Thank you very much for your assistance.

Apr 10 '24 15:04 gaozhihan

VILA VILA copied to clipboard

Question on Multi-Image Input Processing During Training

VILA
VILA copied to clipboard