MobiLlama Question on MobiLlama-V

Question on MobiLlama-V

Open g-h-chen opened this issue 11 months ago • 0 comments

Thanks for your great work! In Multimodal MobiLlama of the Results section, you briefly introduce how you developed MobiLlama-V. The model seems to have a LLaVA-like architecture, but is only trained on the visual instruction tuning data, which is the potential reason that MobiLlama-V exhibits mediocre performance. Hence, my questions are the following:

Can you release more details about the architecture and training process of MobiLlama-V?
Did/Will you perform two-stage training instead of only the second stage?
Do you consider using ALLaVA-4V, a high-quality multimodal dataset for vision-language training? This dataset is proposed to improve the performance of small VLMs.

Thanks!

Mar 04 '24 10:03 g-h-chen

MobiLlama MobiLlama copied to clipboard

Question on MobiLlama-V

MobiLlama
MobiLlama copied to clipboard