Feature Request: Tensor paralellism (--split-mode row) over rpc
Prerequisites
- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Implement tensor parelellism over rpc. At the moment setting --split-mode row has no effect if used for rpc the rpc server.
Could you provide me with a rough outline on how I would best go about it?
What steps would I have to take to extend the functionality of the rpc server?
Motivation
I love your project, its everything i looked for. You guys are true heros, the antidote to nvidias corporate greed.
I am running at home two tesla p100 on old gaming mainboards, connected via an infiniband nic in eth mode. The nic is dirt cheap aswell as the tesla p100, if we can get this to work, you can easily run 8B models with 60+ tps with just two cards.
This will unlock the full potential of homelabs/smaller enterprise.
Love you guys
Possible Implementation
I just started looking into it and found your implementation for row splitting on a single host.
if (split_mode == LLAMA_SPLIT_MODE_ROW) {
ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(dev);
auto ggml_backend_split_buffer_type_fn = (ggml_backend_split_buffer_type_t)
ggml_backend_reg_get_proc_address(reg, "ggml_backend_split_buffer_type");
if (ggml_backend_split_buffer_type_fn) {
size_t dev_index = [&]() {
auto * reg = ggml_backend_dev_backend_reg(dev);
for (size_t i = 0; i < ggml_backend_reg_dev_count(reg); ++i) {
if (ggml_backend_reg_dev_get(reg, i) == dev) {
return i;
}
}
throw std::runtime_error(format("device %s not found in its backend reg", ggml_backend_dev_name(dev)));
}();
auto * buft = ggml_backend_split_buffer_type_fn(dev_index, tensor_split);
if (buft != nullptr) {
buft_list.emplace_back(dev, buft);
}
}
}
Distributing the splits via rpc to different hosts for computation. What files/folders would I need to have a look at, I am asking for some general guidance.
Currently tensor parallelism is implemented entirely in the backends, in a way that is or less a hack. The best way to do this would be to implement it at a higher level, using only ggml-backend APIs, so that it can work with any backend or combination of backends. The event primitives in the ggml-backend interface should be enough to handle synchronization.
One possible way to implement this would be to create a new type of virtual backend that takes a list of backends during initialization and automatically distributes the tensors and the operations to all of them in parallel. This would be a quite complex task, and you shouldn't expect to be able to implement this without first spending time learning the details of ggml-backend. I cannot give you a full guide on how to do this, but I can answer specific questions if there something that you don't understand.
what do you mean by hack? having it at the backend level looks good to my novice eyes. thats how i would try to implement it for the rpc backend aswell.
is there any chance of working more closely with you guys? I think you guys hold the keys to making state of the art models available for everyone. with this feature almost anyone with just a few thousand dollars can serve models with 70B+ parameters, at 50+ token per second ranges.
id love to invest my time and energy into this, i hate these big snobbish corpos that want to gatekeep this technology from everyone else. could we maybe do 1-2 pair sessions to help me get started?
Implementing it via a virtual backend that wraps other backends would be more flexible. It would allow any combination of backends, for example the local CPU or GPU could also be used for tensor parallelism together with the remote RPC servers.
I suggest you get started in this way:
- Start with the BLAS backend, it is the simplest backend and only supports matrix multiplication, which is exactly what you want
- Modify it to forward the matrix multiplication operation to a different backend instead of using the BLAS library
- Next, add support for tiled matrix multiplication using a different backend for each tile
This should give you a good starting point. Next, you will need to implement a buffer type that wraps the buffer type of the other backends and automatically distributes the tiles of the matrices to each backend. After that, it may be usable with very big models in which the overhead of the synchronization is relatively small, but likely will need a lot more optimizations.
I think that is something that definitely is a nice optimization but I don't think this is where the most juice for squeeze lies.
If my benchmarks are correct the biggest gain in performance is using multiple gpus accross multiple hosts in paralell, with optimized network infrastructure.
Without some budget you will never be able to run bigger models, my aim is to reduce that budget as much as possible.
Ill try optimizing for rpc server with cuda backends for now. You even only loose a few percent if you run cuda behind rpcs vs running it directly. Your implementation already rocks!
i spend quite some time now working with your codebase, i implemented the split buffer logic for the rpc backend, only to hit a dead end at the compute step. Its not really feasible to implement this further with different backend instances that more or less all contain the same logic. I would need a way to really get deep into the calculation to combine results from the different mat mul ops, send them to the rpc backend, combine and propagate.
I think I now understand why you wanted me to implement it in a seperate backend. Here is my take and how im planning to proceed, Id love some feedback:
ggml-parallel backend:
- has only ever one device
- doesnt directly wrap other backends but connects to multiple rpc workers of different kinds (ill start with a cuda worker)
- the workers do expose functions for allocating buffers but not for generic cgraph computation, instead they expose much more specific procedures for ops that are worth parallelizing, directly returning results (so the backend can combine the result and propagate)
i fear this will lead to a lot of duplicated code between the the worker and backend implementation of cuda for example, but i dont know how else I should start.
Maybe looking at the source code of distributed-llama might help. I use it on multiple nodes (CPU only) and for now it seems to be the fastest solution (by far) for my use case, so I think this is a good implementation of tensor parallelism.
This issue was closed because it has been inactive for 14 days since being marked as stale.
I don't know I should suppose to write this here or even write this at all but I am working on RPC side of this project and specifically, now, this feature.
Just wanted to let you guys know.