llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

ggml : add RPC backend

Open rgerganov opened this issue 1 year ago • 3 comments

This PR transfers the work started in ggml PR 761 here. It adds an RPC backend which proxies all backend operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). The general idea is to allow distributed LLM inference using multiple hosts running on different kinds of hardware.

This is a sample run which splits the layers of 7B F16 model on two servers, allocating 7G on the first and 6.5G on the second:

...
llm_load_tensors: ggml ctx size =    0,44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   296,88 MiB
llm_load_tensors:        RPC buffer size =  7072,53 MiB
llm_load_tensors:        RPC buffer size =  6537,36 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        RPC KV buffer size =    34,00 MiB
llama_kv_cache_init:        RPC KV buffer size =    30,00 MiB
llama_new_context_with_model: KV self size  =   64,00 MiB, K (f16):   32,00 MiB, V (f16):   32,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,14 MiB
llama_new_context_with_model:       RPC0 compute buffer size =    73,00 MiB
llama_new_context_with_model:       RPC1 compute buffer size =    82,22 MiB
llama_new_context_with_model:        CPU compute buffer size =     9,01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 3
...

Current limitations:

  • Quantum models are not supported
  • Pipeline parallelism is not currently supported
  • Server endpoints are hardcoded in ggml-rpc.cpp

Building:

  1. Install gRPC by following this guide and using -DCMAKE_CXX_STANDARD=14
  2. Build the main example with cmake -DLLAMA_RPC=ON -DCMAKE_PREFIX_PATH=$MY_INSTALL_DIR ..
  3. Build rpc-server in a separate dir, adding the flag for the corresponding backend, e.g. cmake -DLLAMA_RPC=ON -DLLAMA_CUDA=ON -DCMAKE_PREFIX_PATH=$MY_INSTALL_DIR ..

rgerganov avatar Apr 22 '24 14:04 rgerganov

@rgerganov Nice to meet you :D

phymbert avatar Apr 22 '24 14:04 phymbert

in theory, Could this PR allow GPU inference across different API? I have P40 and 7900XTX. Could they work together with their own API?

sorasoras avatar Apr 22 '24 16:04 sorasoras

in theory, Could this PR allow GPU inference across different API?

Yes, you can use different backend implementations, running on different machines. Build an rpc-server for each configuration and run them in the same local network. The main example should be configured with the IP:port of each rpc-server and it should be able to offload model layers to them.

rgerganov avatar Apr 23 '24 08:04 rgerganov

Would be useful to add a CI workflow that builds the RPC backend. No need to run tests for now - just make sure the build succeeds

ggerganov avatar Apr 25 '24 10:04 ggerganov

I tried to implement this without gRPC, using only socket API: https://github.com/rgerganov/llama.cpp/tree/socket-rpc Unfortunately, this implementation performs much worse compared to the gRPC one. When I am running rpc-server on localhost, I get 25t/s with gRPC and 15t/s with my custom socket RPC, using the same model. I don't think my serialization is much worse compared to protobuf, so I guess I am doing the networking part wrong.

I don't like adding gRPC as build time dependency but it looks like it is not trivial to implement this from scratch even for simple synchronous APIs ...

rgerganov avatar Apr 29 '24 07:04 rgerganov

Unfortunately, this implementation performs much worse compared to the gRPC one.

Long shot, but does it help if you disable Nagle's algorithm for the socket: https://stackoverflow.com/a/17843292/4039976

ggerganov avatar Apr 29 '24 09:04 ggerganov

Long shot, but does it help if you disable Nagle's algorithm for the socket

Spot on! Setting TCP_NODELAY is game changer:

CUDA backend: 48 t/s RPC backend with gRPC: 25 t/s RPC backend with socket-rpc: 15 t/s RPC backend with socket-rpc and setting TCP_NODELAY: 43 t/s

gRPC is also setting this by default

rgerganov avatar Apr 29 '24 11:04 rgerganov

I will continue working with my custom socket RPC in this PR. The previous gRPC implementation is still available at https://github.com/rgerganov/llama.cpp/tree/grpc

rgerganov avatar Apr 29 '24 11:04 rgerganov

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 539 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8690.33ms p(95)=20965.24ms fails=, finish reason: stop=485 truncated=54
  • Prompt processing (pp): avg=98.38tk/s p(95)=362.75tk/s
  • Token generation (tg): avg=45.6tk/s p(95)=45.72tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=rpc commit=1519cb4582db5966656b889dda419baead501c31

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715675802 --> 1715676428
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 949.5, 949.5, 949.5, 949.5, 949.5, 869.3, 869.3, 869.3, 869.3, 869.3, 885.45, 885.45, 885.45, 885.45, 885.45, 902.39, 902.39, 902.39, 902.39, 902.39, 868.57, 868.57, 868.57, 868.57, 868.57, 858.49, 858.49, 858.49, 858.49, 858.49, 873.21, 873.21, 873.21, 873.21, 873.21, 884.89, 884.89, 884.89, 884.89, 884.89, 875.48, 875.48, 875.48, 875.48, 875.48, 888.78, 888.78, 888.78, 888.78, 888.78, 860.97, 860.97, 860.97, 860.97, 860.97, 902.28, 902.28, 902.28, 902.28, 902.28, 898.07, 898.07, 898.07, 898.07, 898.07, 899.85, 899.85, 899.85, 899.85, 899.85, 843.75, 843.75, 843.75, 843.75, 843.75, 842.54, 842.54, 842.54, 842.54, 842.54, 846.42, 846.42, 846.42, 846.42, 846.42, 848.24, 848.24, 848.24, 848.24, 848.24, 845.45, 845.45, 845.45, 845.45, 845.45, 808.15, 808.15, 808.15, 808.15, 808.15, 810.6, 810.6, 810.6, 810.6, 810.6, 818.09, 818.09, 818.09, 818.09, 818.09, 821.47, 821.47, 821.47, 821.47, 821.47, 822.74, 822.74, 822.74, 822.74, 822.74, 785.98, 785.98, 785.98, 785.98, 785.98, 787.27, 787.27, 787.27, 787.27, 787.27, 788.67, 788.67, 788.67, 788.67, 788.67, 798.45, 798.45, 798.45, 798.45, 798.45, 803.55, 803.55, 803.55, 803.55, 803.55, 803.43, 803.43, 803.43, 803.43, 803.43, 804.15, 804.15, 804.15, 804.15, 804.15, 806.03, 806.03, 806.03, 806.03, 806.03, 805.08, 805.08, 805.08, 805.08, 805.08, 802.5, 802.5, 802.5, 802.5, 802.5, 807.6, 807.6, 807.6, 807.6, 807.6, 813.53, 813.53, 813.53, 813.53, 813.53, 820.61, 820.61, 820.61, 820.61, 820.61, 832.12, 832.12, 832.12, 832.12, 832.12, 835.13, 835.13, 835.13, 835.13, 835.13, 833.35, 833.35, 833.35, 833.35, 833.35, 836.45, 836.45, 836.45, 836.45, 836.45, 839.8, 839.8, 839.8, 839.8, 839.8, 839.53, 839.53, 839.53, 839.53, 839.53, 790.68, 790.68, 790.68, 790.68, 790.68, 781.86, 781.86, 781.86, 781.86, 781.86, 782.28, 782.28, 782.28, 782.28, 782.28, 781.01, 781.01, 781.01, 781.01, 781.01, 779.66, 779.66, 779.66, 779.66, 779.66, 786.94, 786.94, 786.94, 786.94, 786.94, 786.86, 786.86, 786.86, 786.86, 786.86, 788.48, 788.48, 788.48, 788.48, 788.48, 787.41, 787.41, 787.41, 787.41, 787.41, 794.06, 794.06, 794.06, 794.06, 794.06, 795.63, 795.63, 795.63, 795.63, 795.63, 802.4, 802.4, 802.4, 802.4, 802.4, 802.07, 802.07, 802.07, 802.07, 802.07, 803.34, 803.34, 803.34, 803.34, 803.34, 803.43, 803.43, 803.43, 803.43, 803.43, 805.02, 805.02, 805.02, 805.02, 805.02, 806.4, 806.4, 806.4, 806.4, 806.4, 806.6, 806.6, 806.6, 806.6]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715675802 --> 1715676428
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 43.35, 43.35, 43.35, 43.35, 43.35, 25.97, 25.97, 25.97, 25.97, 25.97, 26.99, 26.99, 26.99, 26.99, 26.99, 31.33, 31.33, 31.33, 31.33, 31.33, 31.91, 31.91, 31.91, 31.91, 31.91, 32.26, 32.26, 32.26, 32.26, 32.26, 32.57, 32.57, 32.57, 32.57, 32.57, 33.37, 33.37, 33.37, 33.37, 33.37, 33.55, 33.55, 33.55, 33.55, 33.55, 33.58, 33.58, 33.58, 33.58, 33.58, 33.78, 33.78, 33.78, 33.78, 33.78, 33.66, 33.66, 33.66, 33.66, 33.66, 32.71, 32.71, 32.71, 32.71, 32.71, 32.42, 32.42, 32.42, 32.42, 32.42, 30.78, 30.78, 30.78, 30.78, 30.78, 30.54, 30.54, 30.54, 30.54, 30.54, 28.66, 28.66, 28.66, 28.66, 28.66, 28.32, 28.32, 28.32, 28.32, 28.32, 28.95, 28.95, 28.95, 28.95, 28.95, 28.87, 28.87, 28.87, 28.87, 28.87, 28.92, 28.92, 28.92, 28.92, 28.92, 29.09, 29.09, 29.09, 29.09, 29.09, 29.23, 29.23, 29.23, 29.23, 29.23, 29.64, 29.64, 29.64, 29.64, 29.64, 29.65, 29.65, 29.65, 29.65, 29.65, 29.68, 29.68, 29.68, 29.68, 29.68, 29.92, 29.92, 29.92, 29.92, 29.92, 30.05, 30.05, 30.05, 30.05, 30.05, 29.76, 29.76, 29.76, 29.76, 29.76, 29.64, 29.64, 29.64, 29.64, 29.64, 29.74, 29.74, 29.74, 29.74, 29.74, 29.95, 29.95, 29.95, 29.95, 29.95, 30.09, 30.09, 30.09, 30.09, 30.09, 30.25, 30.25, 30.25, 30.25, 30.25, 30.3, 30.3, 30.3, 30.3, 30.3, 30.35, 30.35, 30.35, 30.35, 30.35, 30.25, 30.25, 30.25, 30.25, 30.25, 30.0, 30.0, 30.0, 30.0, 30.0, 29.84, 29.84, 29.84, 29.84, 29.84, 29.8, 29.8, 29.8, 29.8, 29.8, 30.01, 30.01, 30.01, 30.01, 30.01, 30.14, 30.14, 30.14, 30.14, 30.14, 30.24, 30.24, 30.24, 30.24, 30.24, 30.38, 30.38, 30.38, 30.38, 30.38, 30.16, 30.16, 30.16, 30.16, 30.16, 30.05, 30.05, 30.05, 30.05, 30.05, 29.64, 29.64, 29.64, 29.64, 29.64, 28.75, 28.75, 28.75, 28.75, 28.75, 28.83, 28.83, 28.83, 28.83, 28.83, 28.87, 28.87, 28.87, 28.87, 28.87, 28.98, 28.98, 28.98, 28.98, 28.98, 29.04, 29.04, 29.04, 29.04, 29.04, 29.11, 29.11, 29.11, 29.11, 29.11, 29.08, 29.08, 29.08, 29.08, 29.08, 29.12, 29.12, 29.12, 29.12, 29.12, 29.08, 29.08, 29.08, 29.08, 29.08, 29.16, 29.16, 29.16, 29.16, 29.16, 29.3, 29.3, 29.3, 29.3, 29.3, 29.43, 29.43, 29.43, 29.43, 29.43, 29.52, 29.52, 29.52, 29.52, 29.52, 29.61, 29.61, 29.61, 29.61]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715675802 --> 1715676428
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23, 0.23, 0.23, 0.23, 0.23, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.34, 0.48, 0.48, 0.48, 0.48, 0.48, 0.43, 0.43, 0.43, 0.43, 0.43, 0.39, 0.39, 0.39, 0.39, 0.39, 0.32, 0.32, 0.32, 0.32, 0.32, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.4, 0.4, 0.4, 0.4, 0.4, 0.59, 0.59, 0.59, 0.59, 0.59, 0.55, 0.55, 0.55, 0.55, 0.55, 0.45, 0.45, 0.45, 0.45, 0.45, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.08, 0.08, 0.08, 0.08, 0.08, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.08, 0.08, 0.08, 0.08, 0.08, 0.19, 0.19, 0.19, 0.19, 0.19, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715675802 --> 1715676428
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0]
                    

github-actions[bot] avatar Apr 29 '24 12:04 github-actions[bot]

llama_max_devices should be updated to return some value higher than 1 when building with RPC. We should probably remove this function, or make it always return the same value, but for now for consistency it needs to return the maximum number of devices, since llama_model_params::tensor_split is documented to have size llama_max_devices.

slaren avatar Apr 30 '24 11:04 slaren

It returns GGML_RPC_MAX_SERVERS now which is set to 16

rgerganov avatar Apr 30 '24 11:04 rgerganov

Thanks for the reviews. I will continue working on this next week. I need to address couple of TODOs, add Windows support, fix some resource leaks and add a README.

rgerganov avatar Apr 30 '24 12:04 rgerganov

I did some performance tests with 2 hosts with NVIDIA GPUs and 3 different models.

Testbed

Host A (IP 192.168.88.100): Dell Precision 5560, 16 cores i7 @ 2.5GHz, NVIDIA T1200 Laptop GPU 4GB VRAM Host B (IP 192.168.88.2) : AMD Ryzen 9, 24 cores, NVIDIA GeForce GTX 1660 6GB VRAM

Both hosts are running Linux and connected on a local gigabit network. In all tests below I am running main on Host A.

tinyllama-1.1b F16 (size: 2,05 GiB)

This a small model which fits in the VRAM of Host A. I am using it to compare the performance of local and remote servers.

CUDA backend

$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 99 -fa

Result: 63 t/s

RPC with local server running CUDA backend

$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.100:50052 -ngl 99 -fa

Result: 60 t/s

RPC with remote server running CUDA backend

$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.2:50052 -ngl 99 -fa

Result: 42 t/s

BgGPT-7B-Instruct-v0.2.Q8_0 (size: 7,22 GiB)

This model doesn't fit entirely neither in Host A, nor in Host B.

CUDA backend

We can offload 12 layers to GPU:

$ bin/main -m ../models/BgGPT-7B-Instruct-v0.2.Q8_0.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 12 -fa

Result: 7.25 t/s

RPC with local and remote server running CUDA backend

$ bin/main -m ../models/BgGPT-7B-Instruct-v0.2.Q8_0.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.2:50052,192.168.88.100:50052 -ngl 99 -fa
...
Connecting to 192.168.88.2:50052
Connecting to 192.168.88.100:50052
llm_load_tensors: ggml ctx size =    0,44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   157,71 MiB
llm_load_tensors:        RPC buffer size =  4641,66 MiB
llm_load_tensors:        RPC buffer size =  2589,07 MiB
...

Result: 14.6 t/s

mistral-7b-instruct-v0.2.Q4_K_M (size: 4,07 GiB)

This model doesn't fit in Host A but it fits in Host B

CUDA backend

We can offload 23 layers to GPU:

$ bin/main -m ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 -ngl 23 -fa

Result: 18 t/s

RPC with remote server running CUDA backend

$ bin/main -m ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.2:50052 -ngl 99 -fa

Result: 25 t/s

RPC with local and remote server running CUDA backend

$ bin/main -m ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.2:50052,192.168.88.100:50052 -ngl 99 -fa

Result: 22 t/s

rgerganov avatar May 10 '24 11:05 rgerganov

With more than one RPC server we can improve the performance with better implementation of cpy_tensor. Currently there is no way to copy tensors directly between RPC servers and they are downloaded and uploaded to the host where main is running. In order to fix this, RPC servers should be able to talk to each other.

rgerganov avatar May 10 '24 11:05 rgerganov

would this also indirectly allow multiple backends on the same machine by running multiple instances of the rpc-server on different ports? e.g. cuda and rocm for a single machine with nvidia and amd gpus?

Weroxig avatar May 10 '24 15:05 Weroxig

would this also indirectly allow multiple backends on the same machine by running multiple instances of the rpc-server on different ports? e.g. cuda and rocm for a single machine with nvidia and amd gpus?

Yes. I should clarify this in the README.

rgerganov avatar May 10 '24 15:05 rgerganov

In the future, it would be good to be able to build with a local backend in addition of the RPC backend, so that a RPC server is not necessary to use the local GPU.

Agree but we should select the backend in runtime instead of compile-time for this to work. I will be working on this next.

rgerganov avatar May 13 '24 09:05 rgerganov

in theory, Could this PR allow GPU inference across different API?

Yes, you can use different backend implementations, running on different machines. Build an rpc-server for each configuration and run them in the same local network. The main example should be configured with the IP:port of each rpc-server and it should be able to offload model layers to them.

really like this! this is basically the same concepts LocalAI is built on, as we have a grpc server on top of llama.cpp and other backends as well - really happy to see this in llama.cpp!

mudler avatar May 13 '24 15:05 mudler

dont stop this candy

elix1er avatar May 20 '24 11:05 elix1er