LLaVA-NeXT
                                
                                
                                
                                    LLaVA-NeXT copied to clipboard
                            
                            
                            
                        [BUG] Function `prepare_inputs_labels_for_multimodal` flattens batch data
In the file llava/model/llava_arch.py under the class LlavaMetaForCausalLM there is a functionprepare_inputs_labels_for_multimodal that is called when calling the generate and forward functions.
In lines 411 and 412, the input embeds change shape:
new_input_embeds = [x[:tokenizer_model_max_length] for x, modality in zip(new_input_embeds, modalities)]
new_labels = [x[:tokenizer_model_max_length] for x, modality in zip(new_labels, modalities)]
when I run with images, modalities is simply the list ["images"] and so if there are multiple inputs in new_input_embeds then they are skipped. removing the modalities from these lines fixes the issue for me.