server
server copied to clipboard
Multi-instance TRT model slower than single-instance one. (GPU)
Description I noticed that a model with several instances is slower than with one. I believe that this should not be the case, but throughput and latency indicators say the opposite.
Triton Information
server - nvcr.io/nvidia/tritonserver:24.03-py3
perf - nvcr.io/nvidia/tritonserver:24.01-py3-sdk
To Reproduce
1) config.pbtxt (model_trt_1)
platform: "tensorrt_plan"
max_batch_size: 1
input[
{
name: "input"
data_type: TYPE_FP32
dims: [3, 1024, 1024]
}
]
output:[
{
name: "logits"
data_type: TYPE_FP32
dims: [1, 1024, 1024]
}
]
instance_group {
count: 1
kind: KIND_GPU
}
2) config.pbtxt (model_trt_4)
platform: "tensorrt_plan"
max_batch_size: 1
input[
{
name: "input"
data_type: TYPE_FP32
dims: [3, 1024, 1024]
}
]
output:[
{
name: "logits"
data_type: TYPE_FP32
dims: [1, 1024, 1024]
}
]
instance_group {
count: 4
kind: KIND_GPU
}
root@s11744725:/workspace# perf_analyzer -m model_trt_1 -u localhost:8000 -i http --shape input:1,3,1024,1024 --concurrency-range 1:4 --asy
nc --percentile 95
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Latency limit: 0 msec
Concurrency limit: 4 concurrent requests
Using asynchronous calls for inference
Stabilizing using p95 latency
Request concurrency: 1
Client:
Request count: 709
Throughput: 39.3808 infer/sec
p50 latency: 25153 usec
p90 latency: 27377 usec
p95 latency: 27527 usec
p99 latency: 32657 usec
Avg HTTP time: 25281 usec (send/recv 8543 usec + response wait 16738 usec)
Server:
Inference count: 709
Execution count: 709
Successful request count: 709
Avg request latency: 13252 usec (overhead 93 usec + queue 122 usec + compute input 2705 usec + compute infer 9062 usec + compute output 1269 usec)
Request concurrency: 2
Client:
Request count: 1715
Throughput: 95.2565 infer/sec
p50 latency: 20681 usec
p90 latency: 21114 usec
p95 latency: 22082 usec
p99 latency: 26529 usec
Avg HTTP time: 20884 usec (send/recv 6295 usec + response wait 14589 usec)
Server:
Inference count: 1715
Execution count: 1715
Successful request count: 1715
Avg request latency: 12796 usec (overhead 78 usec + queue 108 usec + compute input 1294 usec + compute infer 9071 usec + compute output 2243 usec)
Request concurrency: 3
Client:
Request count: 1837
Throughput: 102.038 infer/sec
p50 latency: 29184 usec
p90 latency: 30552 usec
p95 latency: 31036 usec
p99 latency: 37301 usec
Avg HTTP time: 29271 usec (send/recv 8361 usec + response wait 20910 usec)
Server:
Inference count: 1837
Execution count: 1837
Successful request count: 1837
Avg request latency: 15120 usec (overhead 59 usec + queue 243 usec + compute input 525 usec + compute infer 9219 usec + compute output 5073 usec)
Request concurrency: 4
Client:
Request count: 1914
Throughput: 106.312 infer/sec
p50 latency: 37489 usec
p90 latency: 37919 usec
p95 latency: 38124 usec
p99 latency: 39696 usec
Avg HTTP time: 37470 usec (send/recv 8654 usec + response wait 28816 usec)
Server:
Inference count: 1915
Execution count: 1915
Successful request count: 1915
Avg request latency: 26854 usec (overhead 47 usec + queue 8231 usec + compute input 6 usec + compute infer 9388 usec + compute output 9180 usec)
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 39.3808 infer/sec, latency 27527 usec
Concurrency: 2, throughput: 95.2565 infer/sec, latency 22082 usec
Concurrency: 3, throughput: 102.038 infer/sec, latency 31036 usec
Concurrency: 4, throughput: 106.312 infer/sec, latency 38124 usec
root@s11744725:/workspace# perf_analyzer -m model_trt_4 -u localhost:8000 -i http --shape input:1,3,1024,1024 --concurrency-range 1:4 --asy
nc --percentile 95
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Latency limit: 0 msec
Concurrency limit: 4 concurrent requests
Using asynchronous calls for inference
Stabilizing using p95 latency
Request concurrency: 1
Client:
Request count: 733
Throughput: 40.7144 infer/sec
p50 latency: 24062 usec
p90 latency: 26462 usec
p95 latency: 27044 usec
p99 latency: 30431 usec
Avg HTTP time: 24419 usec (send/recv 8199 usec + response wait 16220 usec)
Server:
Inference count: 733
Execution count: 733
Successful request count: 733
Avg request latency: 12701 usec (overhead 89 usec + queue 91 usec + compute input 2145 usec + compute infer 9139 usec + compute output 1236 usec)
Request concurrency: 2
Client:
Request count: 1004
Throughput: 55.7662 infer/sec
p50 latency: 35972 usec
p90 latency: 36536 usec
p95 latency: 36783 usec
p99 latency: 46601 usec
Avg HTTP time: 35683 usec (send/recv 9262 usec + response wait 26421 usec)
Server:
Inference count: 1004
Execution count: 1004
Successful request count: 1004
Avg request latency: 17346 usec (overhead 74 usec + queue 929 usec + compute input 2487 usec + compute infer 12806 usec + compute output 1049 usec)
Request concurrency: 3
Client:
Request count: 1084
Throughput: 60.2104 infer/sec
p50 latency: 49659 usec
p90 latency: 51530 usec
p95 latency: 52160 usec
p99 latency: 61044 usec
Avg HTTP time: 49636 usec (send/recv 12495 usec + response wait 37141 usec)
Server:
Inference count: 1084
Execution count: 1084
Successful request count: 1084
Avg request latency: 22764 usec (overhead 70 usec + queue 1443 usec + compute input 2715 usec + compute infer 17498 usec + compute output 1037 usec)
Request concurrency: 4
Client:
Request count: 1225
Throughput: 68.041 infer/sec
p50 latency: 63755 usec
p90 latency: 69768 usec
p95 latency: 71307 usec
p99 latency: 73666 usec
Avg HTTP time: 58598 usec (send/recv 15400 usec + response wait 43198 usec)
Server:
Inference count: 1225
Execution count: 1225
Successful request count: 1225
Avg request latency: 24254 usec (overhead 67 usec + queue 1256 usec + compute input 2636 usec + compute infer 19198 usec + compute output 1096 usec)
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 40.7144 infer/sec, latency 27044 usec
Concurrency: 2, throughput: 55.7662 infer/sec, latency 36783 usec
Concurrency: 3, throughput: 60.2104 infer/sec, latency 52160 usec
Concurrency: 4, throughput: 68.041 infer/sec, latency 71307 usec
Expected behavior More model instances should provide more throughput.
Hi @decadance-dance I came across this issue during my tests too when I was loading a lot of models into the same Triton pod but it was with Tensorflow and on CPU itself.
- Can you give information on the hardware used, Num of CPU, GPU, Memory, etc?
- Can you try with a larger
max_batch_size
and withdynamic_batching
enabled to do an ablation study? - This is a recommendation from my experience... I've found at least 15% better performance in terms of throughput / latency when I used gRPC rather than http.
Hi @AshwinAmbal
My hardware specs:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A30 Off | 00000000:CA:00.0 Off | 0 |
| N/A 34C P0 33W / 165W | 5984MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
CPU: Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz, 64 cores
RAM: 128GB
When I run a larger max_batch_size I get the error: Internal: autofill failed for model 'model_trt_4': configuration specified max-batch 4 but TensorRT engine only supports max-batch 1
.
Also, I didn't find performance improvements when using gRPC.
P.S. I found a similar behavior not only for the TensorRT backend, for example the ONNX CPU backend.