Send/receive operators via MPI
Companion PR to https://github.com/ggerganov/llama.cpp/pull/2099. This PR adds support for distributed computation with new OP_SEND and OP_RECV operators for passing tensors between compute nodes via MPI.
The substantial changes to this repository are:
- New
GGML_MPIcompile-time option - Replace
char padding[4]withint tagin theggml_tensorstruct; this is used to indicate the source / destination nodes for passing tensors (while preserving the overall struct size) - New send and receive graph operators
If MPI is not added at compile-time, send and receive will choke with assertion failures at graph-computation time. It is the responsibility of the caller to construct correct partial computation graphs on each MPI node.
Currently only float32 tensors are supported. This is sufficient for llama.cpp, but support for other types can be added later. While GGML does not appear to support big-endian architectures, I'd like to keep the messages endian- and architecture-agnostic by using MPI's native data types where possible.
I tested these changes thoroughly in the above-linked PR; in this repo I simply tested through build.
I think there is some overlap between this and the plan to implement mixed CPU/GPU evaluation in llama.cpp by splitting the graph in multiple parts and running each of them in a different compute backend. MPI could be supported with same system by creating a MPI compute backend for every node in the cluster and assigning a part of the graph to each one. It may be better to work on creating one common interface that can support both cases.
I think there is some overlap between this and the plan to implement mixed CPU/GPU evaluation in llama.cpp by splitting the graph in multiple parts and running each of them in a different compute backend. MPI could be supported with same system by creating a MPI compute backend for every node in the cluster and assigning a part of the graph to each one. It may be better to work on creating one common interface that can support both cases.
I suspect that "automatic distribution" across multiple hosts will be difficult to implement... Besides the graph serialization/deserialization headaches, the scheduler will need to reason about the appropriate cut-points based on tensor sizes in order to maximize data locality and minimize network communication. IMHO the developer is in a much better position to optimize the graph for memory usage and communication overhead across the cluster. If you look at the linked PR, I was able to distribute llama.cpp using send/receive primitives with a fairly small amount of code; but this hides the amount of thinking that went in to choosing appropriate places to cut the computation graph.
The idea is not to make the splits automatically, the programmer will still need to choose where to make these splits, and the user will need to specify what backend to use for each split.
It would definitely be more complicated since it is a more general approach, but it would be more flexible, it would simplify the code for the users, and it would avoid code duplication. Your approach also depends on mmap to load only the required parts of the models on each node, but that's not going to work with GPUs.
OK, I think you have a much clearer idea of what a unified API would look like than I do. Would be great to include MPI in that discussion.
My goal for now is to implement this for CUDA in llama.cpp, and to show how this could be done to split computation between the CPU and CUDA in a clean way that can be extended to other backends. I would like to implement that first as a proof of concept, and after that we can work on defining a common interface to all the compute backends, including CPU, CUDA, Vulkan, Metal, OpenCL, and MPI. That may take a few weeks at least, though.
Closing in favor of approach described in https://github.com/ggerganov/llama.cpp/pull/2099#issuecomment-1624260680