llama.cpp Feature Request: RPC offloading using a local model copy

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

RPC loading of model through network takes a lot of time which makes it useless. i think what will be best is like download whole model on both side and then just loading specific number of layer the network tells you to load, making it load only those index of layers speeding up the loading and also saving time on every launch.

Motivation

RPC loading of model through network takes a lot of time.

Possible Implementation

download whole model on both side and then just loading specific number of layer the network tells you to load.

Oct 30 '24 13:10 Abdulhanan535

I would love to see some RPC enhancement

Oct 30 '24 20:10 GoudaCouda

The network handling is also very slow. I rarely see speed above 1Gbps on a 10G direct link connection between 2 servers. It fluctuates like crazy (using fast nvme SSDs and no any other load on the servers).

Nov 09 '24 16:11 Zorg33

This issue was closed because it has been inactive for 14 days since being marked as stale.

Dec 25 '24 01:12 github-actions[bot]

I too would like to see this feature. Or, at the very least a Cache feature, where the entire model can be cached and held until changed out. That way the load only needs to happen over the network on first use. I believe dllama uses an approach like this and their claim is that network updates for inference and not model loading increases throughput significantly, but I'm having issues setting up the dllama cluster properly :/. This really should be put higher in the list.

Jan 06 '25 00:01 piisawheel

this would be great

Jan 11 '25 00:01 suspicious-pineapple

This has been one of the most requested features for the RPC backend, so I will start working on it.

@slaren already gave some ideas here like hashing the payload of set_tensor and using local copies on the server side.

I propose the following implementation for this:

Allow RPC servers to advertise certain capabilities to their clients. This can be enabled with CLI arguments like this:

./rpc-server --enable-cache

Clients send a handshake message after connecting to the server and retrieve the server version and capabilities. If the server supports caching tensor data, we first send a hash and if we receive cache-miss we send the tensor data.

Feb 04 '25 16:02 rgerganov

The server can maybe optionally use mmap when loading the cached data.

Feb 04 '25 16:02 ggerganov

I have been taking a look at this and wondering if we could create a new RPC_CMD_LOAD_TENSOR, which passes in the model path/name(or a hash of the model) and tensor/weights to load. We may have to make the assumption that the model file exists in the same location on both the client and server. If we only load the client's requested tensors this should scale across GPUs of differing sizes.

One other load optimisation is to parallelise the loading on each GPU backend. It seems in the function llama_model_loader::load_all_data() this loads the model sequentially from one GPU to the next.

Feb 09 '25 21:02 lingster

+1

Feb 25 '25 19:02 D-i-t-gh

I was testing the RPC feature just now and I was considering making a cache tensor feature if it wasn't already in the works. I was thinking about the same as @slaren.

I would add a new command for loading a tensor with a given hash. If there's no answer, assume there is no support. We can know when there's no answer because we can send some other command afterwards (like get alignment) to receive its answer. And if there's an answer it will be whether it is cached or not. Maybe splitting it in two commands (check and load) if that makes it easier to load cached tensors and upload uncached ones at the same time.

Mar 15 '25 01:03 Awwtifishal

I wrote the client part in PR #12496. I am using xxhash which should be good enough for our purposes.

For the server part I intend to add a new cli param which specifies a GGUF file. The server will hash all of its tensors and if the client sends a known hash, it will load the data from the GGUF file.

Mar 21 '25 11:03 rgerganov

I did some tests with gemma3-1b-it-f16 (2GB) and two remote hosts (A and B). The latency (measured with ping) to host A is 35ms and to host B is 170ms.

load time with rpc-server running on host A (master): 47 sec load time with rpc-server running on host A (PR): 16 sec load time with rpc-server running on host B (master): 283 sec load time with rpc-server running on host B (PR): 65 sec

I think for best results the rpc-server should be running in a local, low-latency network.

Mar 24 '25 12:03 rgerganov

The current server-side implementation has some shortcomings:

it loads and keeps the entire GGUF in memory which is not really needed; we should compute the hashes at the beginning and maybe keep hash->tensor_name mapping
it currently supports caching only one GGUF file which may not be enough for split models

An alternative implementation for the server would be to auto cache large tensors in some local directory, using the XXH64 hash as file name (16 chars). On startup the rpc-server will traverse all the files in the cache directory and create a map with the known hashes. In runtime it will automatically add new files for large RPC_CMD_SET_TENSOR commands. This way when some new model is used with the server, only the first run will be slow. We can also experiment with backend functions which use mmap for loading tensors and calling those functions through ggml_backend_reg_get_proc_address() in the server.

Any thoughts?

Mar 24 '25 12:03 rgerganov

An alternative implementation for the server ...

I think we can have both implementations work together. If a GGUF file (or more) are passed - load them at start and populate a map of hashes as currently done in the PR. But in addition to this logic, when a hash is not found, we can check for a local file with that hash (note that there is no need to traverse the folder in advance). If found - use it (and remember it in-memory), and if not found - create it using the data from the client. This way, we can even start the server without GGUFs and it will create it's own hashes after the first execution, which might be more convenient for some use cases.

Btw, instead of bringing the XXH64 source, can we simply use std::hash? If the speed is a concern, we can hash just a small part of the tensor data so that it is fast enough.

Mar 24 '25 12:03 ggerganov

Btw, instead of bringing the XXH64 source, can we simply use std::hash?

AFAIK, the C++ standard doesn't specify the algorithms used for std::hash which means that we may get different results with different compilers/toolchains

Mar 25 '25 08:03 rgerganov

A basic FVN hash function should be good enough and it's simple to implement.

Mar 25 '25 09:03 ggerganov

This should be fixed now. Use the -c command line option when starting rpc-server to enable the local cache.

Mar 28 '25 07:03 rgerganov

This should be fixed now. Use the -c command line option when starting rpc-server to enable the local cache.

Thanks, I'm going to give it a try. I'm curious, Why does this need to be only one gguf file and no support for split models? Would it be possible to run one rpc server on a host instead of multiple copies for every gpu?

Apr 17 '25 23:04 segmond