LLaVA-NeXT icon indicating copy to clipboard operation
LLaVA-NeXT copied to clipboard

[BUG] Function `prepare_inputs_labels_for_multimodal` flattens batch data

Open guyazran opened this issue 1 year ago • 2 comments

In the file llava/model/llava_arch.py under the class LlavaMetaForCausalLM there is a functionprepare_inputs_labels_for_multimodal that is called when calling the generate and forward functions. In lines 411 and 412, the input embeds change shape: new_input_embeds = [x[:tokenizer_model_max_length] for x, modality in zip(new_input_embeds, modalities)] new_labels = [x[:tokenizer_model_max_length] for x, modality in zip(new_labels, modalities)] when I run with images, modalities is simply the list ["images"] and so if there are multiple inputs in new_input_embeds then they are skipped. removing the modalities from these lines fixes the issue for me.

guyazran avatar Aug 15 '24 14:08 guyazran