server
server copied to clipboard
Odd throughput latency curve when using perf_analyzer
Description I was running some benchmark tests on Triton to see how the throughput behaved under different request rates, and after graphing the results, I got some odd results in the throughput vs. average latency graphs that I generated. Instead of the characteristic shape where latency remains relatively constant until hitting a knee point where it spikes, the latency started high and decreased until it spiked. This happened over multiple trials, and across multiple instances of a model on a GPU.
I considered the fact that there could be a warmup period where latencies could start really high, but upon further testing with perf_analyzer it seemed like the warmup period was only a few requests before it settled on a stable average latency. The only other program running in the background was dcgmi in order to monitor power usage of the GPU.
In the screenshots, gpu_x refers to the number of instances on a single GPU. Only one GPU was used.
Triton Information I am using the r23.09 Triton container and r23.09 Triton Client container to run perf_analyzer.
To Reproduce
perf_analyzer command:
perf_analyzer -m densenet_onnx --percentile=99 -p 15000 --request-rate-range=<request rate>
Request rate was changed in the for loop of a bash script.
Model being used was the densenet_onnx model from the setup examples. The only configuration change was that trials were run with a varying number of instances, either 1, 2, or 3 instances. GPU used was an RTX 6000. Inputs given to the model were randomly generated by perf_analyzer.
Expected behavior Latency remains low and constant until the system hits a point where it is overloaded and then latency spikes while throughput remains constant.
Can you try upgrading your container to 23.12? cc: @tgerdesnv @debermudez
Re-running the same tests (with a couple added request rates for additional data) with r23.12 yielded the same shape, attached below. It's hard to see in the graph but there is a point where the throughput and latency become near constant (very little variation, but not exactly the same) with higher and higher request rates.
Thanks for sharing @rhamilt. We will investigate. @matthewkotila
@tanmayv25 in case you have any quick ideas
@rhamilt Can you please share the script you used to run the perf analyzer to generate results and these graphs?
@ganeshku1
#!/bin/bash
mkdir -p dcgm_output
mkdir -p perf_output
rm dcgm_output/$1.txt
rm perf_output/$1.txt
rm perf_output/temp_$1.txt
rates=(20 40 60 80 100 120 140 160 180 200 220 240)
for i in ${rates[@]}; do
dcgmi dmon -d 500 -e 150,155,156,203 >> dcgm_output/$1.txt &
perf_analyzer -m densenet_onnx --percentile=99 -p 15000 --request-rate-range=$i >> perf_output/temp_$1.txt
pkill -f dcgmi
echo >> dcgm_output/$1.txt
echo >> dcgm_output/$1.txt
done
grep 'Request Rate:.*,' perf_output/temp_$1.txt > perf_output/$1.txt
The input to the script ($1) is the number of instances of the model that were running in Triton. The graphing script is a separate Python file using matplotlib, but is not related to data collection.
Any updates? @ganeshku1