MobiLlama
MobiLlama copied to clipboard
Question on MobiLlama-V
Thanks for your great work! In Multimodal MobiLlama of the Results section, you briefly introduce how you developed MobiLlama-V. The model seems to have a LLaVA-like architecture, but is only trained on the visual instruction tuning data, which is the potential reason that MobiLlama-V exhibits mediocre performance. Hence, my questions are the following:
- Can you release more details about the architecture and training process of MobiLlama-V?
- Did/Will you perform two-stage training instead of only the second stage?
- Do you consider using ALLaVA-4V, a high-quality multimodal dataset for vision-language training? This dataset is proposed to improve the performance of small VLMs.
Thanks!