server Odd throughput latency curve when using perf

Description I was running some benchmark tests on Triton to see how the throughput behaved under different request rates, and after graphing the results, I got some odd results in the throughput vs. average latency graphs that I generated. Instead of the characteristic shape where latency remains relatively constant until hitting a knee point where it spikes, the latency started high and decreased until it spiked. This happened over multiple trials, and across multiple instances of a model on a GPU.

I considered the fact that there could be a warmup period where latencies could start really high, but upon further testing with perf_analyzer it seemed like the warmup period was only a few requests before it settled on a stable average latency. The only other program running in the background was dcgmi in order to monitor power usage of the GPU.

In the screenshots, gpu_x refers to the number of instances on a single GPU. Only one GPU was used. Screenshot 2024-01-21 113300 Screenshot 2024-01-21 114444 Screenshot 2024-01-21 114458

Triton Information I am using the r23.09 Triton container and r23.09 Triton Client container to run perf_analyzer.

To Reproduce perf_analyzer command: perf_analyzer -m densenet_onnx --percentile=99 -p 15000 --request-rate-range=<request rate> Request rate was changed in the for loop of a bash script.

Model being used was the densenet_onnx model from the setup examples. The only configuration change was that trials were run with a varying number of instances, either 1, 2, or 3 instances. GPU used was an RTX 6000. Inputs given to the model were randomly generated by perf_analyzer.

Expected behavior Latency remains low and constant until the system hits a point where it is overloaded and then latency spikes while throughput remains constant.

Jan 21 '24 19:01 rhamilt

Can you try upgrading your container to 23.12? cc: @tgerdesnv @debermudez

Jan 23 '24 01:01 jbkyang-nvi

Re-running the same tests (with a couple added request rates for additional data) with r23.12 yielded the same shape, attached below. It's hard to see in the graph but there is a point where the throughput and latency become near constant (very little variation, but not exactly the same) with higher and higher request rates. Screenshot 2024-01-22 202247

Jan 23 '24 04:01 rhamilt

Thanks for sharing @rhamilt. We will investigate. @matthewkotila

Jan 23 '24 14:01 tgerdesnv

@tanmayv25 in case you have any quick ideas

Jan 29 '24 17:01 matthewkotila

@rhamilt Can you please share the script you used to run the perf analyzer to generate results and these graphs?

Jan 30 '24 21:01 ganeshku1

@ganeshku1

#!/bin/bash

mkdir -p dcgm_output
mkdir -p perf_output

rm dcgm_output/$1.txt
rm perf_output/$1.txt
rm perf_output/temp_$1.txt

rates=(20 40 60 80 100 120 140 160 180 200 220 240)

for i in ${rates[@]}; do
  dcgmi dmon -d 500 -e 150,155,156,203 >> dcgm_output/$1.txt &
  perf_analyzer -m densenet_onnx --percentile=99 -p 15000 --request-rate-range=$i  >> perf_output/temp_$1.txt
  pkill -f dcgmi
  echo >> dcgm_output/$1.txt
  echo >> dcgm_output/$1.txt
done
grep 'Request Rate:.*,' perf_output/temp_$1.txt > perf_output/$1.txt

The input to the script ($1) is the number of instances of the model that were running in Triton. The graphing script is a separate Python file using matplotlib, but is not related to data collection.

Jan 31 '24 17:01 rhamilt

Any updates? @ganeshku1

Feb 22 '24 18:02 rhamilt

server
server copied to clipboard

Odd throughput latency curve when using perf_analyzer

server server copied to clipboard

Odd throughput latency curve when using perf_analyzer

server
server copied to clipboard