LLaVA-Mini icon indicating copy to clipboard operation
LLaVA-Mini copied to clipboard

Suggesting to cite prior work mPLUG-Owl3 and other related cross-attention based efficient MLLMs

Open LukeForeverYoung opened this issue 10 months ago • 4 comments

Excellent work on advancing efficient MLLMs. I observe that LLaVA-Mini employs transformer layers initialized from the language model for multimodal fusion through cross-attention mechanisms, sharing similarities with prior works such as mPLUG-Owl3, which repurposes the transformer layers within the language model to execute both cross-attention and self-attention operations in parallel. To strengthen the contextual foundation of efficient MLLM research, we suggest adding related cross-attention architectures in your references. Specifically, foundational works like Flamingo, EVLM, and LLaMA-Vision could be cited to better situate your work within the landscape of efficient MLLM development.

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models (https://arxiv.org/abs/2408.04840)[https://arxiv.org/abs/2408.04840]

LukeForeverYoung avatar Feb 08 '25 11:02 LukeForeverYoung

Many work uses this version, but added too much parameters, not sure with such large additional computation added, less token is meaningful or not

MonolithFoundation avatar Feb 11 '25 09:02 MonolithFoundation

@MonolithFoundation Do you mean Llava-mini or mPLUG-Owl3?

MiloQ avatar Feb 12 '25 08:02 MiloQ

Anything with a Resampler

MonolithFoundation avatar Feb 12 '25 11:02 MonolithFoundation

@MonolithFoundation How to explain the speed gain mentioned in these papers, such as Flops ?

MiloQ avatar Feb 13 '25 02:02 MiloQ