distributed-llama icon indicating copy to clipboard operation
distributed-llama copied to clipboard

Does distributed-llama currently support multimodal models?

Open SherronBurtint opened this issue 7 months ago • 3 comments

Does distributed-llama currently support multimodal models? For example, llava.

I tried and found that it can run, but I can't make inferences based on pictures

In addition, do you need edge node device testing? We have a lot of idle edge nodes and can provide relevant assistance and support

SherronBurtint avatar May 23 '25 01:05 SherronBurtint

Same question here.

mvsoom avatar Jun 07 '25 09:06 mvsoom

I'd also like to know this, but am curious to how to managed to run a llava model under distributed-llama, @SherronBurtint - would you be able to share?

cjastone avatar Jul 13 '25 08:07 cjastone

@cjastone I tried LLaVA based on LLaMA 3, and it can indeed be converted into .m model and run. However, I believe the conversion only covers the pure language model layers of LLaMA 3, essentially ignoring the vision encoder part (CLIP). So at the moment, multimodal support is not possible

StrangeZuo avatar Aug 20 '25 03:08 StrangeZuo