Can we use in-context multimodal data for finetuning?

Open waltonfuture opened this issue 1 year ago • 6 comments

Thanks for your great work! However, it seems that we can only use data that contains one image for SFT. Can we use in-context multimodal data (i.e., containing multiple images) for finetuning?

Jun 06 '24 19:06 waltonfuture

yes, the code supports multi-image finetuning

Jun 07 '24 13:06 qyc-98

yes, the code supports multi-image finetuning

Thank you. How should I organize my data for multi-image sft? And how to inference with multi-image?

Jun 07 '24 14:06 waltonfuture

Same problem here. Any update on multi-image sft?

Jun 11 '24 02:06 haochuan-li

@qyc-98 Hello! Can you provide some simple examples of in-context inference or SFT? Thanks a lot!

Jun 13 '24 09:06 waltonfuture

@qyc-98 I have encountered the same problem. Have you resolved it

Jun 14 '24 06:06 1SingleFeng

+1 also curious about this

Jun 24 '24 20:06 pbarker