LLaVA-Mini Suggesting to cite prior work mPLUG-Owl3 and other related cross-attention based efficient MLLMs

Excellent work on advancing efficient MLLMs. I observe that LLaVA-Mini employs transformer layers initialized from the language model for multimodal fusion through cross-attention mechanisms, sharing similarities with prior works such as mPLUG-Owl3, which repurposes the transformer layers within the language model to execute both cross-attention and self-attention operations in parallel. To strengthen the contextual foundation of efficient MLLM research, we suggest adding related cross-attention architectures in your references. Specifically, foundational works like Flamingo, EVLM, and LLaMA-Vision could be cited to better situate your work within the landscape of efficient MLLM development.

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models (https://arxiv.org/abs/2408.04840)[https://arxiv.org/abs/2408.04840]

Feb 08 '25 11:02 LukeForeverYoung

Many work uses this version, but added too much parameters, not sure with such large additional computation added, less token is meaningful or not

Feb 11 '25 09:02 MonolithFoundation

@MonolithFoundation Do you mean Llava-mini or mPLUG-Owl3?

Feb 12 '25 08:02 MiloQ

Anything with a Resampler

Feb 12 '25 11:02 MonolithFoundation

@MonolithFoundation How to explain the speed gain mentioned in these papers, such as Flops ？

Feb 13 '25 02:02 MiloQ