server
server copied to clipboard
A fluctuating result is obtained when perf_analyze is run for a pressure test
Description I used the latest image version 24.06 because the corresponding latest version of trt has support for BF16. But when I deploy the model with trt-backend. I used perf_analyze to pressure test the model service and got a fluctuating result.
Triton Information 2.47.0
Are you using the Triton container or did you build it yourself?
image version 24.06
To Reproduce perf_analyze
perf_analyzer --concurrency-range 1:8 -p 5000 --latency-threshold 300 -f perf.csv -m my_model_name -i grpc --request-distribution poisson -b 256 -u localhost:6601 --percentile 99 --input-data=random
My pressure test results:
*** Measurement Settings ***
Batch size: 256
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Latency limit: 300 msec
Concurrency limit: 8 concurrent requests
Using synchronous calls for inference
Stabilizing using p99 latency
Request concurrency: 1
Client:
Request count: 1299
Throughput: 18473.3 infer/sec
p50 latency: 13806 usec
p90 latency: 13945 usec
p95 latency: 14248 usec
p99 latency: 14610 usec
Avg gRPC time: 13836 usec ((un)marshal request/response 1300 usec + response wait 12536 usec)
Server:
Inference count: 332544
Execution count: 1299
Successful request count: 1299
Avg request latency: 11282 usec (overhead 34 usec + queue 28 usec + compute input 2846 usec + compute infer 8318 usec + compute output 55 usec)
Request concurrency: 2
Client:
Request count: 1611
Throughput: 22910.6 infer/sec
p50 latency: 22316 usec
p90 latency: 22440 usec
p95 latency: 22488 usec
p99 latency: 22598 usec
Avg gRPC time: 0 usec ((un)marshal request/response 0 usec + response wait 0 usec)
Server:
Inference count: 412416
Execution count: 1611
Successful request count: 1611
Avg request latency: 19400 usec (overhead 37 usec + queue 8099 usec + compute input 2840 usec + compute infer 8327 usec + compute output 96 usec)
Request concurrency: 3
Client:
Request count: 1091
Throughput: 15515.2 infer/sec
p50 latency: 49428 usec
p90 latency: 49735 usec
p95 latency: 50021 usec
p99 latency: 54494 usec
Avg gRPC time: 49517 usec ((un)marshal request/response 1346 usec + response wait 48171 usec)
Server:
Inference count: 279296
Execution count: 727
Successful request count: 1091
Avg request latency: 46345 usec (overhead 119 usec + queue 20338 usec + compute input 3312 usec + compute infer 22479 usec + compute output 96 usec)
Request concurrency: 4
Client:
Request count: 2135
Throughput: 30362.8 infer/sec
p50 latency: 33672 usec
p90 latency: 33822 usec
p95 latency: 33867 usec
p99 latency: 33992 usec
Avg gRPC time: 0 usec ((un)marshal request/response 0 usec + response wait 0 usec)
Server:
Inference count: 546560
Execution count: 1068
Successful request count: 2135
Avg request latency: 30395 usec (overhead 153 usec + queue 13290 usec + compute input 3549 usec + compute infer 13301 usec + compute output 101 usec)
Request concurrency: 5
Client:
Request count: 2136
Throughput: 30377 infer/sec
p50 latency: 36885 usec
p90 latency: 50683 usec
p95 latency: 50778 usec
p99 latency: 51032 usec
Avg gRPC time: 0 usec ((un)marshal request/response 0 usec + response wait 0 usec)
Server:
Inference count: 546816
Execution count: 1068
Successful request count: 2136
Avg request latency: 38631 usec (overhead 154 usec + queue 21520 usec + compute input 3572 usec + compute infer 13285 usec + compute output 99 usec)
Request concurrency: 6
Client:
Request count: 2136
Throughput: 30377 infer/sec
p50 latency: 50544 usec
p90 latency: 50729 usec
p95 latency: 50806 usec
p99 latency: 50961 usec
Avg gRPC time: 0 usec ((un)marshal request/response 0 usec + response wait 0 usec)
Server:
Inference count: 546816
Execution count: 1068
Successful request count: 2136
Avg request latency: 47023 usec (overhead 171 usec + queue 29900 usec + compute input 3580 usec + compute infer 13271 usec + compute output 100 usec)
Request concurrency: 7
Client:
Request count: 1497
Throughput: 21289.7 infer/sec
p50 latency: 84223 usec
p90 latency: 84519 usec
p95 latency: 84635 usec
p99 latency: 87573 usec
Avg gRPC time: 0 usec ((un)marshal request/response 0 usec + response wait 0 usec)
Server:
Inference count: 383232
Execution count: 855
Successful request count: 1497
Avg request latency: 80567 usec (overhead 157 usec + queue 59422 usec + compute input 3555 usec + compute infer 17334 usec + compute output 98 usec)
Request concurrency: 8
Client:
Request count: 2130
Throughput: 30291.3 infer/sec
p50 latency: 67560 usec
p90 latency: 67819 usec
p95 latency: 67898 usec
p99 latency: 68080 usec
Avg gRPC time: 67562 usec ((un)marshal request/response 1426 usec + response wait 66136 usec)
Server:
Inference count: 545280
Execution count: 1065
Successful request count: 2130
Avg request latency: 64059 usec (overhead 178 usec + queue 46884 usec + compute input 3670 usec + compute infer 13226 usec + compute output 101 usec)
Concurrency: 1, throughput: 18473.3 infer/sec, latency 14610 usec
Concurrency: 2, throughput: 22910.6 infer/sec, latency 22598 usec
Concurrency: 3, throughput: 15515.2 infer/sec, latency 54494 usec
Concurrency: 4, throughput: 30362.8 infer/sec, latency 33992 usec
Concurrency: 5, throughput: 30377 infer/sec, latency 51032 usec
Concurrency: 6, throughput: 30377 infer/sec, latency 50961 usec
Concurrency: 7, throughput: 21289.7 infer/sec, latency 87573 usec
Concurrency: 8, throughput: 30291.3 infer/sec, latency 68080 usec
You can see that the throughput drops significantly when the concurrency is 3 or 7. This seems very strange. Does anyone know a possible cause.
Some Settings in config.pbtxt:
max_batch_size: 512
instance_group {
count: 1
kind: KIND_GPU
}
dynamic_batching {
preferred_batch_size: 256
max_queue_delay_microseconds: 100
}
optimization {
cuda {
busy_wait_events: true
output_copy_stream: true
}
}
Expected behavior Is there a statistical problem with the time taken? Or is there a configuration problem? Hope to see a more stable outcome
CC @matthewkotila @nv-hwoo if you have any thoughts on the variance or improvements to the provided PA arguments
I don't have any concrete ideas on why this would be happening.
@LinGeLin have you tried re-running the entire experiment multiple times to confirm that it consistently shows degraded performance for concurrencies 3 and 7? Perhaps you'll want to decrease the stability percentage (-s)? And/or increase the measurement window (--measurement-interval)?