[Draft] Tensor Parallel support to llama.cpp
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [ ] Low
- [ * ] Medium
- [ ] High Add tensor parallel support to llama.cpp, still draft code now.
https://github.com/ggerganov/llama.cpp/issues/9086 Refer to this issue for detailed design.
@ClarkChin08 It's great to see this feature is implemented.
Is it possible to update the guide/doc to explain how to use this feature:
- how to enable it.
- what's the benefit.
- which case should use this feature.
- update the installation for dependent package (oneCCL, MPI) in oneAPI.
Thank you!
hello - was this feature completed?
@ClarkChin08, hello - was this feature completed?
Hi, thanks and appreciate the work. It would be great to have this feature added/completed, which will bring great performance for multi gpu setup, similar to what vllm already has.
This looks really interesting! Having tp support like vllm does would bring some great speed ups!
looking forward to having this feature.
Just a bump, this feature would be really great for the community.
I suspect the OP has abandoned development and this feature is incomplete.
There's actually some much more recent progress on this in https://github.com/ggml-org/llama.cpp/pull/13818 and https://github.com/ggml-org/llama.cpp/pull/13776, but it's not ready yet.