llama.cpp Bug: rpc-server --mem Doesn't Match backend memory

What happened?

$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 10000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 1808 MB
$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 20000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 3616 MB
$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 30000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 1328 MB

I expected backend memory: $mem MB when I input --mem $mem

Name and Version

$ ./build/bin/Release/llama-cli --version
version: 3368 (dd07a123)
built with MSVC 19.40.33812.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 10000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 1808 MB
$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 20000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 3616 MB
$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 30000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 1328 MB

Jul 10 '24 19:07 oldgithubman

This should actually be high severity

Jul 10 '24 19:07 oldgithubman

Memory limits(rpc-server --mem) are not working!!

Jul 10 '24 22:07 myan-o

Memory limits(rpc-server --mem) are not working!!

? I know? That's what I'm saying?

Jul 11 '24 01:07 oldgithubman

There is a problem where all memory is used even if --mem is specified.

Jul 11 '24 02:07 myan-o

There is a problem where all memory is used even if --mem is specified.

Awesome. /s Thanks for telling me though

Jul 11 '24 05:07 oldgithubman

It loads only the number of layers set with --ngl, so it crashes due to a buffer overflow.

Jul 11 '24 16:07 myan-o

Ideally, it would be better to change the specification so that -ngl can be set individually on the RPC server side.

Jul 11 '24 16:07 myan-o

Ideally, it would be better to change the specification so that -ngl can be set individually on the RPC server side.

I think fixing --mem would be better. Remote servers should be as hands-off as possible and -ngl should ideally become a --mem -type option as well. Would make way more sense than -ngl

Jul 11 '24 16:07 oldgithubman

q.v.

I also found the way the RPC server and client deals with specifying / limiting what memory on the CPU / GPU resources to be confusing and limited and so I, too, would like to see simple / clear means of limiting what memory (RAM/VRAM) is used on each node. IMO it'd also be nicer if the model data could be locally loaded vs. uploaded over the network to the RPC servers, too.

#8112 Bug: [RPC] RPC apparently isn't honoring backend memory capacity et. al. #8112

#8113 Feature Request: Provide means to quantify the restriction of RAM/VRAM usage for each GPU and system RAM. #8113

#8114 Feature Request: It would be convenient and faster if users could specify that the model data used for a RPC-server instance is already available by some fast(er) means (file system GGUF, whatever).

Jul 19 '24 05:07 ghchris2021

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sep 02 '24 01:09 github-actions[bot]

This still happens, it should be reopened.

Mar 15 '25 00:03 Awwtifishal