MobiLlama icon indicating copy to clipboard operation
MobiLlama copied to clipboard

Question on MobiLlama-V

Open g-h-chen opened this issue 11 months ago • 0 comments

Thanks for your great work! In Multimodal MobiLlama of the Results section, you briefly introduce how you developed MobiLlama-V. The model seems to have a LLaVA-like architecture, but is only trained on the visual instruction tuning data, which is the potential reason that MobiLlama-V exhibits mediocre performance. Hence, my questions are the following:

  1. Can you release more details about the architecture and training process of MobiLlama-V?
  2. Did/Will you perform two-stage training instead of only the second stage?
  3. Do you consider using ALLaVA-4V, a high-quality multimodal dataset for vision-language training? This dataset is proposed to improve the performance of small VLMs.

Thanks!

g-h-chen avatar Mar 04 '24 10:03 g-h-chen