llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

[WIP] Rpc split row

Open LeaveNhA opened this issue 3 months ago • 6 comments

The PR

This work aims one goal; having row-splitting mode on RPC clusters.

TL;DR

I got bored, wanted to contribute and join you guys on this beautiful journey. I hope you welcome me.

Details & Background

Heavy WIP situation, including this description, I will work on this PR and make sure it fits well with the rest of the project.

For the background:

Metal devices have only one GPU. This is a bit tricky because Row splitting has no use on one device/backend. But the ultimate goal is having it, so with RPC, devices can calculate inference effectively and faster. For this, I worked on both sides. I implemented a very, very early stage of row wise splitting mode on Metal backend and then make it work with RPC too.

The current PR has the implementation, but, -be aware- the performance is unacceptable and every device you add to the cluster, it gets worse. I will inspect the PR and will read sources I can find to have the Domain Knowledge I need to have to solve this.

Tests & Results:

❯ ./build-rpc-split-mode-row-release/bin/llama-bench -m ../llama.cpp.org.new.rpc/hfmodels/models/llama-2-7b.Q4_0.gguf --split-mode row --rpc 127.0.0.1:50052
| model                          |       size |     params | backend    | threads |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ----: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS,RPC |       8 |   row |           pp512 |         54.29 ± 6.55 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS,RPC |       8 |   row |           tg128 |          0.64 ± 0.01 |

build: 997e3047 (6444)

In any cases, every comment, suggestion and support are very welcome.

LeaveNhA avatar Sep 16 '25 03:09 LeaveNhA

Is anybody still working on a backend-agnostic row splitting implementation?

jeffbolznv avatar Sep 16 '25 09:09 jeffbolznv

Is anybody still working on a backend-agnostic row splitting implementation?

I don't think anyone is working on this atm. Will tag @slaren and @JohannesGaessler in case they are aware of any ongoing efforts.

ggerganov avatar Sep 16 '25 09:09 ggerganov

@koush had an initial implementation (https://github.com/ggml-org/llama.cpp/pull/13818#issuecomment-2927411762), but I am not sure if that's still being worked on.

slaren avatar Sep 16 '25 09:09 slaren

My current priorities not specific to CUDA are automating how to distribute tensors to GPUs (by reusing the code from https://github.com/ggml-org/llama.cpp/pull/15860) and then I intend to get back to working on backend-agnostic tensor parallelism.

In parallel I'm refactoring and deduplicating the FlashAttention CUDA code and optimizing it for AMD. Since I've already invested the effort to read the AMD ISA documentations I'll probably buy an RDNA4 GPU and implement better support for the AMD equivalent of tensor cores.

JohannesGaessler avatar Sep 16 '25 10:09 JohannesGaessler

Is anybody still working on a backend-agnostic row splitting implementation?

Backend agnostic approach would be much more valuable in the big picture, if you ask me.

On the other hand, if I can get in touch with @koush and get sync about the current situation, I can gladly get on board with another PR to make this feature work on both alone and cluster mode.

LeaveNhA avatar Sep 17 '25 05:09 LeaveNhA

in hopes of helping, a project called Petals (MIT License) already did this for some models:

https://github.com/bigscience-workshop/petals/tree/main/src/petals/models/llama

perhaps it would be inspiring as a source for a more model agnostic RPC equivalent

bennmann avatar Nov 10 '25 18:11 bennmann