Otter icon indicating copy to clipboard operation
Otter copied to clipboard

About choosing dataset format and pre-training weights

Open xmc-andy opened this issue 1 year ago • 2 comments

Hello, authors! I have a question about choosing a dataset format and corresponding weights. I am doing a classification task with multiple images and prompt input. If multiple images are regarded as videos, there are two options: SD format (single <image> + single <Users>, where <image> represents all images) and DC mode (single <image> + multiple <Users>) . I understand their difference lies in the use of prompt. DC mode is more suitable for each picture with detailed prompts, while SD mode is suitable for all pictures to use a unified prompt. Is my understanding correct?

In addition, I used the Image-MPT7B weight in SD mode before, but it seems that the Video-LLaMA7B-DenseCaption weight in DC/SD mode is more suitable for the video frame mode. Is my understanding correct?

xmc-andy avatar Nov 13 '23 07:11 xmc-andy

Yes, it's pretty correct! I suggest you use DC mode and use Video pretrained weights. You could see via our web demo, the backend model is Video-LLaMA7B-DC.

Remember to put the multiple images as frames in the [B, T, F, C, H, W]'s F dimension (debug at vision_x to see the actual dimension during your training) And I will suggest you to try both template:

1. <image> + prompt
2. <image><image>...<image> + prompt

For training DC, we use the first.

Luodian avatar Nov 13 '23 07:11 Luodian

Thank you so much!

xmc-andy avatar Nov 13 '23 07:11 xmc-andy