server icon indicating copy to clipboard operation
server copied to clipboard

I am not able to specify the request rate along with the concurrency range

Open KhaledButainy opened this issue 1 year ago • 4 comments

I am not sure if this is a bug or a feature to be requested, but I wanted your support anyways.

I am trying to run perf_analyzer for a concurrency range 1:4:1 and request rate of 20 requests per second.

My understanding is that the concurrency range simulates the number of concurrent requests that arrive to the Triton server, while the request rate simulates how fast they arrive.

In other words, I am trying to run the following command:

perf_analyzer -m <model_name> --shape <input_layer>:<input_shape> --percentile=95 --input-data <input_data_directory> --concurrency-range 1:4:1 --request-rate-range 20 --verbose-csv --collect-metrics 1 -f client_test.csv

But it does not work unless I remove either the --concurrency-range or the --request-rate-range

Your help would be very much appreciated.

Also, if you have any other workarounds, just let me know. I am open to other solutions.

Thanks.

KhaledButainy avatar Jan 31 '24 23:01 KhaledButainy

Hi @KhaledButainy, thanks for raising this issue.

@matthewkotila could you comment on the valid combinations of parameters for PA here?

rmccorm4 avatar Feb 02 '24 03:02 rmccorm4

You can't specify both concurrency and request rate because they affect each other. Here's a simple counterexample:

Imagine you specify concurrency of 2 and a request rate of 4 requests per second. And imagine the server/model takes 1 second to complete an individual inference. Within the first second, if Perf Analyzer tries to send 4 requests to meet the request rate specification, the concurrency would grow beyond the specified 2--it would be 4. If Perf Analyzer tries to maintain a concurrency of 2, then it wouldn't be able to meet the specified request rate of 4 requests per second (because the server/model takes 1 second to complete each inference, it would only complete 2 requests in the first second if it is trying to meet the concurrency of 2).

matthewkotila avatar Feb 02 '24 16:02 matthewkotila

Thank you @matthewkotila and @rmccorm4 .

I am actually trying to simulate a connection between one server and multiple clients sending requests at a specific rate at the same time.

For example, assume I know that my server takes 1 second to complete 30 inference requests. Therefore, I can either have one client sending 30 requests/second, or 2 clients sending at 15 requests/second, or 3 sending at 10 requests/second ... etc.

On the other hand, if I don't know my server capacity, then I need to fix the request rate for each client, and start increasing the concurrency (which represents the number of clients) to measure how many concurrent sessions I can fit within my server capacity given that each one is sending at a fixed rate.

Is there anyway to simulate this scenario? Your example demonstrates the extreme case where the number of incoming requests is higher than that what the model can handle, and therefore, latency will keep increasing without any increase in the throughput.

KhaledButainy avatar Feb 02 '24 20:02 KhaledButainy

Concurrency is the number of outstanding requests to maintain. If you are looking to find the performance associated with a certain request rate, then you can use a variety of request rates to see the performance there.

I do not believe there is a good way to simulate running multiple clients via perf analyzer (@matthewkotila can correct me if that has changed), so the numbers shown could be a bit different due to only needing one connection versus multiple. However, you will get the numbers you are looking for with the above approach.

dyastremsky avatar Feb 20 '24 20:02 dyastremsky