distributed-llama Does distributed-llama currently support multimodal models?

Does distributed-llama currently support multimodal models? For example, llava.

I tried and found that it can run, but I can't make inferences based on pictures

In addition, do you need edge node device testing? We have a lot of idle edge nodes and can provide relevant assistance and support

May 23 '25 01:05 SherronBurtint

Same question here.

Jun 07 '25 09:06 mvsoom

I'd also like to know this, but am curious to how to managed to run a llava model under distributed-llama, @SherronBurtint - would you be able to share?

Jul 13 '25 08:07 cjastone

@cjastone I tried LLaVA based on LLaMA 3, and it can indeed be converted into .m model and run. However, I believe the conversion only covers the pure language model layers of LLaMA 3, essentially ignoring the vision encoder part (CLIP). So at the moment, multimodal support is not possible

Aug 20 '25 03:08 StrangeZuo