llama.cpp
llama.cpp copied to clipboard
Feature Request: Split model over multiple Vulkan GPUs
Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the README.md.
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Related to #5259 (closed), if you want I could move this there.
How hard would it be to implement splitting over vulkan GPUs instead of CUDA/HIP?
I guess OpenCL could be another path if vulkan is too hard, since there's now a maturing rusticl driver that can be layered on top of vulkan as well as various native drivers, but it may not be fully baked enough yet to support llamacpp (though maybe that's changing [1]). Also, afaik mapping memory between GPUs in a multi-GPU config is still under active development/implementation.
[1] https://archive.fosdem.org/2024/events/attachments/fosdem-2024-3364-why-not-run-opencl-accelerated-llm-on-your-phone-/slides/22383/Why_not_run_OpenCL-accelerated_LLM_on_your_phon_nK2DudB.pdf
Motivation
This would be really helpful as it's now not unreasonable to want to ditch nvidia's drivers for the open source NVK vulkan driver, and AMD's cards are also MUCH better supported with vulkan on the RADV driver than with AMD's spotty/nonexistant ROCm/HIP support. Vulkan is also more universally supported, so this could enable someone to split a model over eg. an AMD and an nvidia GPU if that's what they have.
Possible Implementation
N/A