Wondering whether CogVLM2 supports SFT for multi-image QA in a sample
Feature request / 功能建议
Hi, CogVLM2 team.
Thank you for your brilliant work and this neat and easy-to-follow codebase. This morning, I've read through this repo quickly, and I have some related question to ask. Now, my challenge is to simultaneously input 6 images in a single QA turn, and ask some questions which need to retrieve image information from six images (e.g. What are important objects in these six images and tell me why). To achieve this target, it seems that I only need to prepare the dataset, modify the following code lines (https://github.com/THUDM/CogVLM2/blob/2af46662b2cad1ba0acf743a42e3b61437e2c6df/finetune_demo/peft_lora.py#L71-L93) , feed the images as a python list and select suitable max_input_tokens. Since currently I don't have enough knowledge about your vision encoders of CogVLM2, so I'm asking the above questions. Could you please tell me my understanding is right? In addition, if the max_input_tokens is fixed to 8192, we can simultaneously feed 6 images? Thank you again for your great work and look forwards to your reply.
Best regards, Xuefen
Motivation / 动机
Support multi-image SFT
Your contribution / 您的贡献
None
Any feedbacks plz?
This model can only process one image. If multiple images are used, subsequent incoming images will overwrite the previous images. This is a limitation of the model structure. The max_input_tokens parameter is for the language
As a hack, you can try "merging" several images into 1 image, but you'd probably have to finetune to model a bit.