llama.cpp
llama.cpp copied to clipboard
ggml : add RPC backend
This PR transfers the work started in ggml PR 761 here. It adds an RPC backend which proxies all backend operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). The general idea is to allow distributed LLM inference using multiple hosts running on different kinds of hardware.
This is a sample run which splits the layers of 7B F16 model on two servers, allocating 7G on the first and 6.5G on the second:
...
llm_load_tensors: ggml ctx size = 0,44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 296,88 MiB
llm_load_tensors: RPC buffer size = 7072,53 MiB
llm_load_tensors: RPC buffer size = 6537,36 MiB
.................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: RPC KV buffer size = 34,00 MiB
llama_kv_cache_init: RPC KV buffer size = 30,00 MiB
llama_new_context_with_model: KV self size = 64,00 MiB, K (f16): 32,00 MiB, V (f16): 32,00 MiB
llama_new_context_with_model: CPU output buffer size = 0,14 MiB
llama_new_context_with_model: RPC0 compute buffer size = 73,00 MiB
llama_new_context_with_model: RPC1 compute buffer size = 82,22 MiB
llama_new_context_with_model: CPU compute buffer size = 9,01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 3
...
Current limitations:
- Quantum models are not supported
- Pipeline parallelism is not currently supported
- Server endpoints are hardcoded in
ggml-rpc.cpp
Building:
- Install gRPC by following this guide and using
-DCMAKE_CXX_STANDARD=14 - Build the
mainexample withcmake -DLLAMA_RPC=ON -DCMAKE_PREFIX_PATH=$MY_INSTALL_DIR .. - Build
rpc-serverin a separate dir, adding the flag for the corresponding backend, e.g.cmake -DLLAMA_RPC=ON -DLLAMA_CUDA=ON -DCMAKE_PREFIX_PATH=$MY_INSTALL_DIR ..
@rgerganov Nice to meet you :D
in theory, Could this PR allow GPU inference across different API? I have P40 and 7900XTX. Could they work together with their own API?
in theory, Could this PR allow GPU inference across different API?
Yes, you can use different backend implementations, running on different machines. Build an rpc-server for each configuration and run them in the same local network. The main example should be configured with the IP:port of each rpc-server and it should be able to offload model layers to them.
Would be useful to add a CI workflow that builds the RPC backend. No need to run tests for now - just make sure the build succeeds
I tried to implement this without gRPC, using only socket API: https://github.com/rgerganov/llama.cpp/tree/socket-rpc
Unfortunately, this implementation performs much worse compared to the gRPC one. When I am running rpc-server on localhost, I get 25t/s with gRPC and 15t/s with my custom socket RPC, using the same model. I don't think my serialization is much worse compared to protobuf, so I guess I am doing the networking part wrong.
I don't like adding gRPC as build time dependency but it looks like it is not trivial to implement this from scratch even for simple synchronous APIs ...
Unfortunately, this implementation performs much worse compared to the gRPC one.
Long shot, but does it help if you disable Nagle's algorithm for the socket: https://stackoverflow.com/a/17843292/4039976
Long shot, but does it help if you disable Nagle's algorithm for the socket
Spot on! Setting TCP_NODELAY is game changer:
CUDA backend: 48 t/s
RPC backend with gRPC: 25 t/s
RPC backend with socket-rpc: 15 t/s
RPC backend with socket-rpc and setting TCP_NODELAY: 43 t/s
gRPC is also setting this by default
I will continue working with my custom socket RPC in this PR. The previous gRPC implementation is still available at https://github.com/rgerganov/llama.cpp/tree/grpc
📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 539 iterations 🚀
Expand details for performance related PR only
- Concurrent users: 8, duration: 10m
- HTTP request : avg=8690.33ms p(95)=20965.24ms fails=, finish reason: stop=485 truncated=54
- Prompt processing (pp): avg=98.38tk/s p(95)=362.75tk/s
- Token generation (tg): avg=45.6tk/s p(95)=45.72tk/s
- ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=rpc commit=1519cb4582db5966656b889dda419baead501c31
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 539 iterations"
y-axis "llamacpp:prompt_tokens_seconds"
x-axis "llamacpp:prompt_tokens_seconds" 1715675802 --> 1715676428
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 949.5, 949.5, 949.5, 949.5, 949.5, 869.3, 869.3, 869.3, 869.3, 869.3, 885.45, 885.45, 885.45, 885.45, 885.45, 902.39, 902.39, 902.39, 902.39, 902.39, 868.57, 868.57, 868.57, 868.57, 868.57, 858.49, 858.49, 858.49, 858.49, 858.49, 873.21, 873.21, 873.21, 873.21, 873.21, 884.89, 884.89, 884.89, 884.89, 884.89, 875.48, 875.48, 875.48, 875.48, 875.48, 888.78, 888.78, 888.78, 888.78, 888.78, 860.97, 860.97, 860.97, 860.97, 860.97, 902.28, 902.28, 902.28, 902.28, 902.28, 898.07, 898.07, 898.07, 898.07, 898.07, 899.85, 899.85, 899.85, 899.85, 899.85, 843.75, 843.75, 843.75, 843.75, 843.75, 842.54, 842.54, 842.54, 842.54, 842.54, 846.42, 846.42, 846.42, 846.42, 846.42, 848.24, 848.24, 848.24, 848.24, 848.24, 845.45, 845.45, 845.45, 845.45, 845.45, 808.15, 808.15, 808.15, 808.15, 808.15, 810.6, 810.6, 810.6, 810.6, 810.6, 818.09, 818.09, 818.09, 818.09, 818.09, 821.47, 821.47, 821.47, 821.47, 821.47, 822.74, 822.74, 822.74, 822.74, 822.74, 785.98, 785.98, 785.98, 785.98, 785.98, 787.27, 787.27, 787.27, 787.27, 787.27, 788.67, 788.67, 788.67, 788.67, 788.67, 798.45, 798.45, 798.45, 798.45, 798.45, 803.55, 803.55, 803.55, 803.55, 803.55, 803.43, 803.43, 803.43, 803.43, 803.43, 804.15, 804.15, 804.15, 804.15, 804.15, 806.03, 806.03, 806.03, 806.03, 806.03, 805.08, 805.08, 805.08, 805.08, 805.08, 802.5, 802.5, 802.5, 802.5, 802.5, 807.6, 807.6, 807.6, 807.6, 807.6, 813.53, 813.53, 813.53, 813.53, 813.53, 820.61, 820.61, 820.61, 820.61, 820.61, 832.12, 832.12, 832.12, 832.12, 832.12, 835.13, 835.13, 835.13, 835.13, 835.13, 833.35, 833.35, 833.35, 833.35, 833.35, 836.45, 836.45, 836.45, 836.45, 836.45, 839.8, 839.8, 839.8, 839.8, 839.8, 839.53, 839.53, 839.53, 839.53, 839.53, 790.68, 790.68, 790.68, 790.68, 790.68, 781.86, 781.86, 781.86, 781.86, 781.86, 782.28, 782.28, 782.28, 782.28, 782.28, 781.01, 781.01, 781.01, 781.01, 781.01, 779.66, 779.66, 779.66, 779.66, 779.66, 786.94, 786.94, 786.94, 786.94, 786.94, 786.86, 786.86, 786.86, 786.86, 786.86, 788.48, 788.48, 788.48, 788.48, 788.48, 787.41, 787.41, 787.41, 787.41, 787.41, 794.06, 794.06, 794.06, 794.06, 794.06, 795.63, 795.63, 795.63, 795.63, 795.63, 802.4, 802.4, 802.4, 802.4, 802.4, 802.07, 802.07, 802.07, 802.07, 802.07, 803.34, 803.34, 803.34, 803.34, 803.34, 803.43, 803.43, 803.43, 803.43, 803.43, 805.02, 805.02, 805.02, 805.02, 805.02, 806.4, 806.4, 806.4, 806.4, 806.4, 806.6, 806.6, 806.6, 806.6]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 539 iterations"
y-axis "llamacpp:predicted_tokens_seconds"
x-axis "llamacpp:predicted_tokens_seconds" 1715675802 --> 1715676428
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 43.35, 43.35, 43.35, 43.35, 43.35, 25.97, 25.97, 25.97, 25.97, 25.97, 26.99, 26.99, 26.99, 26.99, 26.99, 31.33, 31.33, 31.33, 31.33, 31.33, 31.91, 31.91, 31.91, 31.91, 31.91, 32.26, 32.26, 32.26, 32.26, 32.26, 32.57, 32.57, 32.57, 32.57, 32.57, 33.37, 33.37, 33.37, 33.37, 33.37, 33.55, 33.55, 33.55, 33.55, 33.55, 33.58, 33.58, 33.58, 33.58, 33.58, 33.78, 33.78, 33.78, 33.78, 33.78, 33.66, 33.66, 33.66, 33.66, 33.66, 32.71, 32.71, 32.71, 32.71, 32.71, 32.42, 32.42, 32.42, 32.42, 32.42, 30.78, 30.78, 30.78, 30.78, 30.78, 30.54, 30.54, 30.54, 30.54, 30.54, 28.66, 28.66, 28.66, 28.66, 28.66, 28.32, 28.32, 28.32, 28.32, 28.32, 28.95, 28.95, 28.95, 28.95, 28.95, 28.87, 28.87, 28.87, 28.87, 28.87, 28.92, 28.92, 28.92, 28.92, 28.92, 29.09, 29.09, 29.09, 29.09, 29.09, 29.23, 29.23, 29.23, 29.23, 29.23, 29.64, 29.64, 29.64, 29.64, 29.64, 29.65, 29.65, 29.65, 29.65, 29.65, 29.68, 29.68, 29.68, 29.68, 29.68, 29.92, 29.92, 29.92, 29.92, 29.92, 30.05, 30.05, 30.05, 30.05, 30.05, 29.76, 29.76, 29.76, 29.76, 29.76, 29.64, 29.64, 29.64, 29.64, 29.64, 29.74, 29.74, 29.74, 29.74, 29.74, 29.95, 29.95, 29.95, 29.95, 29.95, 30.09, 30.09, 30.09, 30.09, 30.09, 30.25, 30.25, 30.25, 30.25, 30.25, 30.3, 30.3, 30.3, 30.3, 30.3, 30.35, 30.35, 30.35, 30.35, 30.35, 30.25, 30.25, 30.25, 30.25, 30.25, 30.0, 30.0, 30.0, 30.0, 30.0, 29.84, 29.84, 29.84, 29.84, 29.84, 29.8, 29.8, 29.8, 29.8, 29.8, 30.01, 30.01, 30.01, 30.01, 30.01, 30.14, 30.14, 30.14, 30.14, 30.14, 30.24, 30.24, 30.24, 30.24, 30.24, 30.38, 30.38, 30.38, 30.38, 30.38, 30.16, 30.16, 30.16, 30.16, 30.16, 30.05, 30.05, 30.05, 30.05, 30.05, 29.64, 29.64, 29.64, 29.64, 29.64, 28.75, 28.75, 28.75, 28.75, 28.75, 28.83, 28.83, 28.83, 28.83, 28.83, 28.87, 28.87, 28.87, 28.87, 28.87, 28.98, 28.98, 28.98, 28.98, 28.98, 29.04, 29.04, 29.04, 29.04, 29.04, 29.11, 29.11, 29.11, 29.11, 29.11, 29.08, 29.08, 29.08, 29.08, 29.08, 29.12, 29.12, 29.12, 29.12, 29.12, 29.08, 29.08, 29.08, 29.08, 29.08, 29.16, 29.16, 29.16, 29.16, 29.16, 29.3, 29.3, 29.3, 29.3, 29.3, 29.43, 29.43, 29.43, 29.43, 29.43, 29.52, 29.52, 29.52, 29.52, 29.52, 29.61, 29.61, 29.61, 29.61]
Details
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 539 iterations"
y-axis "llamacpp:kv_cache_usage_ratio"
x-axis "llamacpp:kv_cache_usage_ratio" 1715675802 --> 1715676428
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23, 0.23, 0.23, 0.23, 0.23, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.48, 0.48, 0.48, 0.48, 0.48, 0.43, 0.43, 0.43, 0.43, 0.43, 0.39, 0.39, 0.39, 0.39, 0.39, 0.32, 0.32, 0.32, 0.32, 0.32, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.4, 0.4, 0.4, 0.4, 0.4, 0.59, 0.59, 0.59, 0.59, 0.59, 0.55, 0.55, 0.55, 0.55, 0.55, 0.45, 0.45, 0.45, 0.45, 0.45, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.08, 0.08, 0.08, 0.08, 0.08, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.08, 0.08, 0.08, 0.08, 0.08, 0.19, 0.19, 0.19, 0.19, 0.19, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 539 iterations"
y-axis "llamacpp:requests_processing"
x-axis "llamacpp:requests_processing" 1715675802 --> 1715676428
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0]
llama_max_devices should be updated to return some value higher than 1 when building with RPC. We should probably remove this function, or make it always return the same value, but for now for consistency it needs to return the maximum number of devices, since llama_model_params::tensor_split is documented to have size llama_max_devices.
It returns GGML_RPC_MAX_SERVERS now which is set to 16
Thanks for the reviews. I will continue working on this next week. I need to address couple of TODOs, add Windows support, fix some resource leaks and add a README.
I did some performance tests with 2 hosts with NVIDIA GPUs and 3 different models.
Testbed
Host A (IP 192.168.88.100): Dell Precision 5560, 16 cores i7 @ 2.5GHz, NVIDIA T1200 Laptop GPU 4GB VRAM Host B (IP 192.168.88.2) : AMD Ryzen 9, 24 cores, NVIDIA GeForce GTX 1660 6GB VRAM
Both hosts are running Linux and connected on a local gigabit network. In all tests below I am running main on Host A.
tinyllama-1.1b F16 (size: 2,05 GiB)
This a small model which fits in the VRAM of Host A. I am using it to compare the performance of local and remote servers.
CUDA backend
$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 99 -fa
Result: 63 t/s
RPC with local server running CUDA backend
$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.100:50052 -ngl 99 -fa
Result: 60 t/s
RPC with remote server running CUDA backend
$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.2:50052 -ngl 99 -fa
Result: 42 t/s
BgGPT-7B-Instruct-v0.2.Q8_0 (size: 7,22 GiB)
This model doesn't fit entirely neither in Host A, nor in Host B.
CUDA backend
We can offload 12 layers to GPU:
$ bin/main -m ../models/BgGPT-7B-Instruct-v0.2.Q8_0.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 12 -fa
Result: 7.25 t/s
RPC with local and remote server running CUDA backend
$ bin/main -m ../models/BgGPT-7B-Instruct-v0.2.Q8_0.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.2:50052,192.168.88.100:50052 -ngl 99 -fa
...
Connecting to 192.168.88.2:50052
Connecting to 192.168.88.100:50052
llm_load_tensors: ggml ctx size = 0,44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 157,71 MiB
llm_load_tensors: RPC buffer size = 4641,66 MiB
llm_load_tensors: RPC buffer size = 2589,07 MiB
...
Result: 14.6 t/s
mistral-7b-instruct-v0.2.Q4_K_M (size: 4,07 GiB)
This model doesn't fit in Host A but it fits in Host B
CUDA backend
We can offload 23 layers to GPU:
$ bin/main -m ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 23 -fa
Result: 18 t/s
RPC with remote server running CUDA backend
$ bin/main -m ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.2:50052 -ngl 99 -fa
Result: 25 t/s
RPC with local and remote server running CUDA backend
$ bin/main -m ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.2:50052,192.168.88.100:50052 -ngl 99 -fa
Result: 22 t/s
With more than one RPC server we can improve the performance with better implementation of cpy_tensor. Currently there is no way to copy tensors directly between RPC servers and they are downloaded and uploaded to the host where main is running. In order to fix this, RPC servers should be able to talk to each other.
would this also indirectly allow multiple backends on the same machine by running multiple instances of the rpc-server on different ports? e.g. cuda and rocm for a single machine with nvidia and amd gpus?
would this also indirectly allow multiple backends on the same machine by running multiple instances of the rpc-server on different ports? e.g. cuda and rocm for a single machine with nvidia and amd gpus?
Yes. I should clarify this in the README.
In the future, it would be good to be able to build with a local backend in addition of the RPC backend, so that a RPC server is not necessary to use the local GPU.
Agree but we should select the backend in runtime instead of compile-time for this to work. I will be working on this next.
in theory, Could this PR allow GPU inference across different API?
Yes, you can use different backend implementations, running on different machines. Build an
rpc-serverfor each configuration and run them in the same local network. Themainexample should be configured with the IP:port of eachrpc-serverand it should be able to offload model layers to them.
really like this! this is basically the same concepts LocalAI is built on, as we have a grpc server on top of llama.cpp and other backends as well - really happy to see this in llama.cpp!
dont stop this candy