Suggesting to cite prior work mPLUG-Owl3 and other related cross-attention based efficient MLLMs
Excellent work on advancing efficient MLLMs. I observe that LLaVA-Mini employs transformer layers initialized from the language model for multimodal fusion through cross-attention mechanisms, sharing similarities with prior works such as mPLUG-Owl3, which repurposes the transformer layers within the language model to execute both cross-attention and self-attention operations in parallel. To strengthen the contextual foundation of efficient MLLM research, we suggest adding related cross-attention architectures in your references. Specifically, foundational works like Flamingo, EVLM, and LLaMA-Vision could be cited to better situate your work within the landscape of efficient MLLM development.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models (https://arxiv.org/abs/2408.04840)[https://arxiv.org/abs/2408.04840]
Many work uses this version, but added too much parameters, not sure with such large additional computation added, less token is meaningful or not
@MonolithFoundation Do you mean Llava-mini or mPLUG-Owl3?
Anything with a Resampler
@MonolithFoundation How to explain the speed gain mentioned in these papers, such as Flops ?