Otter [train] Single- or multi-round multi-image training

I was very please to run across this very impressive project, thanks for the contribution.

In some domains, such as pathology and radiology images and it can take more than one image/resolution to describe a region of interest due to image size or count. In this case we don't need to compare images, but allow several images to represent one thing. This would be similar in concept to MIL visual modeling (https://github.com/Project-MONAI/tutorials/tree/main/pathology/multiple_instance_learning).

I have run across several post [1-2] discussing multi-image conversations, but I could not find any information on how a model might be trained with multi-images. A multi-round solution might work, but from a training prospective I would like to explore training multiple images for a single response. With larger context sizes, including 15-20 images along with a narrative report would be possible.

Any help exploring this topic would be appreciated.

[1] https://github.com/Luodian/Otter/pull/150 [2] https://github.com/Luodian/Otter/issues/89

Aug 04 '23 12:08 codybum

Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here: https://github.com/Luodian/Otter/blob/9b34a4467581869c67dae7ea2b970f8e6b201d3c/pipeline/mimicit_utils/mimicit_dataset.py#L432

To achieve this, you may follow these steps:

Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:

"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },

Modify this line from:

elif cur_train_id.startswith("SD"):

to:

elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"):

This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:

--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \

to:

--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \

If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

Aug 04 '23 12:08 ZhangYuanhan-AI

@ZhangYuanhan-AI Wow! Thank you for the quick response, I will let you know how training goes, and will of course cite your project on any works.

Aug 04 '23 13:08 codybum

I am also interested in using multiple images for a single response. Could you expand upon how this works under the hood? Does it just concatenate the input images in the prompt? Or is there a special way it combines them?

Aug 11 '23 15:08 LarsDoorenbos

I am also interested in using multiple images for a single response. Could you expand upon how this works under the hood? Does it just concatenate the input images in the prompt? Or is there a special way it combines them?

Since the Otter/Flamingo model's vision_x's shape is [B, T, F, C, H, W], where T is in-context examples and F is frames. (details could be seen in Flamingo's paper).

If you need to train with multi-images, there are two scenarios.

regard them as frames with sequential order. The model has a time_embeddings to handle the frames to assign the sequential relationship to it. Then you treat them like video input and orgainze your dataset like DC/TVC/E4D. In this way, your training prompt should be designed as <image>User: {instruction} GPT:<answer>. Only one denote the whole frames. In our SD subset, we treat it this way.
regard them as in-context examples. You could refer LA_I2I/T2T part to organize dataset. In this way, if train with two images. The prompt should be <image><image>GPT: {instruction} User:<answer>.

Aug 11 '23 16:08 Luodian

in scenario 1， the model can't do video frame location ?

Aug 22 '23 08:08 helloword12345678

Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:

https://github.com/Luodian/Otter/blob/9b34a4467581869c67dae7ea2b970f8e6b201d3c/pipeline/mimicit_utils/mimicit_dataset.py#L432

To achieve this, you may follow these steps:

Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:
"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },
Modify this line from:
elif cur_train_id.startswith("SD"): 
to:
elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"): 
This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \
to:
--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \
If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

Thanks for your work, is there any related inference/evaluation code (input is multi-images && single prompt for SD format)?

Aug 24 '23 10:08 xmc-andy

Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:

https://github.com/Luodian/Otter/blob/9b34a4467581869c67dae7ea2b970f8e6b201d3c/pipeline/mimicit_utils/mimicit_dataset.py#L432

To achieve this, you may follow these steps:

Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:
"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },
Modify this line from:
elif cur_train_id.startswith("SD"): 
to:
elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"): 
This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \
to:
--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \
If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

This is really helpful. But it seems the code has been updated and there is no "process_spot_the_difference()" function in the same file. Any ideas how to fientune the model to handle multiple images for single reponse in the current version?

Mar 13 '24 18:03 iz2late

Otter Otter copied to clipboard

[train] Single- or multi-round multi-image training

Otter
Otter copied to clipboard