server icon indicating copy to clipboard operation
server copied to clipboard

Multi-instance TRT model slower than single-instance one. (GPU)

Open decadance-dance opened this issue 10 months ago • 3 comments

Description I noticed that a model with several instances is slower than with one. I believe that this should not be the case, but throughput and latency indicators say the opposite.

Triton Information server - nvcr.io/nvidia/tritonserver:24.03-py3 perf - nvcr.io/nvidia/tritonserver:24.01-py3-sdk

To Reproduce

TRT model

1) config.pbtxt (model_trt_1)
platform: "tensorrt_plan"
max_batch_size: 1
input[
    {
        name: "input"
        data_type:  TYPE_FP32
        dims: [3, 1024, 1024]
    }
]
output:[
    {
        name: "logits"
        data_type:  TYPE_FP32
        dims: [1, 1024, 1024]
    }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
2) config.pbtxt (model_trt_4)
platform: "tensorrt_plan"
max_batch_size: 1
input[
    {
        name: "input"
        data_type:  TYPE_FP32
        dims: [3, 1024, 1024]
    }
]
output:[
    {
        name: "logits"
        data_type:  TYPE_FP32
        dims: [1, 1024, 1024]
    }
]
instance_group {
  count: 4
  kind: KIND_GPU
}
root@s11744725:/workspace# perf_analyzer -m model_trt_1 -u localhost:8000 -i http --shape input:1,3,1024,1024 --concurrency-range 1:4 --asy
nc --percentile 95
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 4 concurrent requests
  Using asynchronous calls for inference
  Stabilizing using p95 latency

Request concurrency: 1
  Client: 
    Request count: 709
    Throughput: 39.3808 infer/sec
    p50 latency: 25153 usec
    p90 latency: 27377 usec
    p95 latency: 27527 usec
    p99 latency: 32657 usec
    Avg HTTP time: 25281 usec (send/recv 8543 usec + response wait 16738 usec)
  Server: 
    Inference count: 709
    Execution count: 709
    Successful request count: 709
    Avg request latency: 13252 usec (overhead 93 usec + queue 122 usec + compute input 2705 usec + compute infer 9062 usec + compute output 1269 usec)

Request concurrency: 2
  Client: 
    Request count: 1715
    Throughput: 95.2565 infer/sec
    p50 latency: 20681 usec
    p90 latency: 21114 usec
    p95 latency: 22082 usec
    p99 latency: 26529 usec
    Avg HTTP time: 20884 usec (send/recv 6295 usec + response wait 14589 usec)
  Server: 
    Inference count: 1715
    Execution count: 1715
    Successful request count: 1715
    Avg request latency: 12796 usec (overhead 78 usec + queue 108 usec + compute input 1294 usec + compute infer 9071 usec + compute output 2243 usec)

Request concurrency: 3
  Client: 
    Request count: 1837
    Throughput: 102.038 infer/sec
    p50 latency: 29184 usec
    p90 latency: 30552 usec
    p95 latency: 31036 usec
    p99 latency: 37301 usec
    Avg HTTP time: 29271 usec (send/recv 8361 usec + response wait 20910 usec)
  Server: 
    Inference count: 1837
    Execution count: 1837
    Successful request count: 1837
    Avg request latency: 15120 usec (overhead 59 usec + queue 243 usec + compute input 525 usec + compute infer 9219 usec + compute output 5073 usec)

Request concurrency: 4
  Client: 
    Request count: 1914
    Throughput: 106.312 infer/sec
    p50 latency: 37489 usec
    p90 latency: 37919 usec
    p95 latency: 38124 usec
    p99 latency: 39696 usec
    Avg HTTP time: 37470 usec (send/recv 8654 usec + response wait 28816 usec)
  Server: 
    Inference count: 1915
    Execution count: 1915
    Successful request count: 1915
    Avg request latency: 26854 usec (overhead 47 usec + queue 8231 usec + compute input 6 usec + compute infer 9388 usec + compute output 9180 usec)

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 39.3808 infer/sec, latency 27527 usec
Concurrency: 2, throughput: 95.2565 infer/sec, latency 22082 usec
Concurrency: 3, throughput: 102.038 infer/sec, latency 31036 usec
Concurrency: 4, throughput: 106.312 infer/sec, latency 38124 usec
root@s11744725:/workspace# perf_analyzer -m model_trt_4 -u localhost:8000 -i http --shape input:1,3,1024,1024 --concurrency-range 1:4 --asy
nc --percentile 95
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 4 concurrent requests
  Using asynchronous calls for inference
  Stabilizing using p95 latency

Request concurrency: 1
  Client: 
    Request count: 733
    Throughput: 40.7144 infer/sec
    p50 latency: 24062 usec
    p90 latency: 26462 usec
    p95 latency: 27044 usec
    p99 latency: 30431 usec
    Avg HTTP time: 24419 usec (send/recv 8199 usec + response wait 16220 usec)
  Server: 
    Inference count: 733
    Execution count: 733
    Successful request count: 733
    Avg request latency: 12701 usec (overhead 89 usec + queue 91 usec + compute input 2145 usec + compute infer 9139 usec + compute output 1236 usec)

Request concurrency: 2
  Client: 
    Request count: 1004
    Throughput: 55.7662 infer/sec
    p50 latency: 35972 usec
    p90 latency: 36536 usec
    p95 latency: 36783 usec
    p99 latency: 46601 usec
    Avg HTTP time: 35683 usec (send/recv 9262 usec + response wait 26421 usec)
  Server: 
    Inference count: 1004
    Execution count: 1004
    Successful request count: 1004
    Avg request latency: 17346 usec (overhead 74 usec + queue 929 usec + compute input 2487 usec + compute infer 12806 usec + compute output 1049 usec)

Request concurrency: 3
  Client: 
    Request count: 1084
    Throughput: 60.2104 infer/sec
    p50 latency: 49659 usec
    p90 latency: 51530 usec
    p95 latency: 52160 usec
    p99 latency: 61044 usec
    Avg HTTP time: 49636 usec (send/recv 12495 usec + response wait 37141 usec)
  Server: 
    Inference count: 1084
    Execution count: 1084
    Successful request count: 1084
    Avg request latency: 22764 usec (overhead 70 usec + queue 1443 usec + compute input 2715 usec + compute infer 17498 usec + compute output 1037 usec)

Request concurrency: 4
  Client: 
    Request count: 1225
    Throughput: 68.041 infer/sec
    p50 latency: 63755 usec
    p90 latency: 69768 usec
    p95 latency: 71307 usec
    p99 latency: 73666 usec
    Avg HTTP time: 58598 usec (send/recv 15400 usec + response wait 43198 usec)
  Server: 
    Inference count: 1225
    Execution count: 1225
    Successful request count: 1225
    Avg request latency: 24254 usec (overhead 67 usec + queue 1256 usec + compute input 2636 usec + compute infer 19198 usec + compute output 1096 usec)

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 40.7144 infer/sec, latency 27044 usec
Concurrency: 2, throughput: 55.7662 infer/sec, latency 36783 usec
Concurrency: 3, throughput: 60.2104 infer/sec, latency 52160 usec
Concurrency: 4, throughput: 68.041 infer/sec, latency 71307 usec

Expected behavior More model instances should provide more throughput.

decadance-dance avatar Apr 05 '24 16:04 decadance-dance

Hi @decadance-dance I came across this issue during my tests too when I was loading a lot of models into the same Triton pod but it was with Tensorflow and on CPU itself.

  1. Can you give information on the hardware used, Num of CPU, GPU, Memory, etc?
  2. Can you try with a larger max_batch_size and with dynamic_batching enabled to do an ablation study?
  3. This is a recommendation from my experience... I've found at least 15% better performance in terms of throughput / latency when I used gRPC rather than http.

AshwinAmbal avatar Apr 05 '24 20:04 AshwinAmbal

Hi @AshwinAmbal

My hardware specs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A30          Off  | 00000000:CA:00.0 Off |                    0 |
| N/A   34C    P0    33W / 165W |   5984MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

CPU: Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz, 64 cores RAM: 128GB

When I run a larger max_batch_size I get the error: Internal: autofill failed for model 'model_trt_4': configuration specified max-batch 4 but TensorRT engine only supports max-batch 1.

Also, I didn't find performance improvements when using gRPC.

P.S. I found a similar behavior not only for the TensorRT backend, for example the ONNX CPU backend.

decadance-dance avatar Apr 08 '24 12:04 decadance-dance