Inquiry about MoE (Mixture of Experts) Training Support
Hello VILA team!
First, thank you for open-sourcing this incredible family of Vision Language Models! The work on VILA, NVILA, and is truly impressive, and the focus on efficiency and deployment is particularly valuable for the community.
I have been exploring the codebase and documentation with great interest. My question is regarding the future development roadmap: Are there any plans to support training VILA models with a Mixture of Experts (MoE) architecture(such as Qwen3-MOE, Deepseek-MOE models)?
The integration of MoE could be a powerful way to further scale the model's capacity and capabilities while maintaining inference efficiency, which aligns perfectly with the project's goals. This would be especially exciting for handling even more complex multi-image and long-video understanding tasks.
I would be very interested to know if this is a direction you are considering.
Yes, we are working on MoE models supports and will soon bump out a version