onnxruntime_backend icon indicating copy to clipboard operation
onnxruntime_backend copied to clipboard

Request statistics reported incorrectly

Open vkatms opened this issue 2 years ago • 0 comments
trafficstars

Description Incorrect values are reported as request durations when multiple requests are batched together.

Triton Information What version of Triton are you using? 22.12

Are you using the Triton container or did you build it yourself? I'm using Triton API inproc by linking to tritonserver.so which I built myself

To Reproduce Get a model with a batch dimension, enable dynamic batching and run a few requests concurrently. Then call TRITONSERVER_ServerModelStatistics API, parse its output to JSON as is done for example here and examine the JSON. Please refer to this page for the description of statistics data.

Below is the output I get from my model. Sorry it's long and I can't find a way to format it better on Github. I hope it will be enough to illustrate the issue without the need to create a simpler repro. But if a simpler repro is needed please let me know. The JSON has two sections: stats for all inference request under inference_stats element and per batch size stats under batch_stats element.

Let's look at duration metrics, such as compute_input, compute_infer and compute_output which are reported in both sections. I expected the sum of those numbers for all batch sizes to equal the number reported as total. I.e. for the example below which has batch sizes 1,2,3,4,5,6,8 I expected compute_infer(batch_size=1)+compute_infer(batch_size=2)+...+compute_infer(batch_size=8) to equal compute_infer reported as total under inference_stats element. But in this example that's not the case:

281845006+28360472+3312169+442083+439883+2072518+462581 is 316934712 != 368605960.

But if we multiply the batch durations by the batch sizes, then the math works out: 1*281845006+2*28360472+3*3312169+4*442083+5*439883+6*2072518+8*462581 == 368605960.

I looked into this and it seems that the problem is with code here. Basically, when the backend reports those statistics it reports the same duration for each request in the batch as for the batch itself. So e.g. if there are 2 requests in the batch and batch completes in 10ms then the backend will report duration of 10ms for each of the two requests and for the batch itself. Then these per-requests durations are aggregated (e.g. here) and become 20ms reported as total even though both of these requests were actually processed in 10ms it took to process the batch.

Therefore inference_stats numbers overstate the durations and therefore are useless when dynamic batching is used. But getting these numbers correctly is very important for performance optimization, because tools like perf_analyzer rely on them

I guess the best solution to this is to change this line to adjust the end times so that each request's duration is 1/batch_size fraction of the total, but then there will be rounding errors which should be accounted for as well.

{ "model_stats": [ { "name": "decoder", "version": "1", "last_inference": 1672110878962, "inference_count": 437, "execution_count": 349, "inference_stats": { "success": { "count": 437, "ns": 2098376803 }, "fail": { "count": 0, "ns": 0 }, "queue": { "count": 437, "ns": 1674632856 }, "compute_input": { "count": 437, "ns": 35891659 }, "compute_infer": { "count": 437, "ns": 368605960 }, "compute_output": { "count": 437, "ns": 9855404 }, "cache_hit": { "count": 0, "ns": 0 }, "cache_miss": { "count": 0, "ns": 0 } }, "batch_stats": [ { "batch_size": 1, "compute_input": { "count": 293, "ns": 25860476 }, "compute_infer": { "count": 293, "ns": 281845006 }, "compute_output": { "count": 293, "ns": 5354188 } }, { "batch_size": 2, "compute_input": { "count": 44, "ns": 2814888 }, "compute_infer": { "count": 44, "ns": 28360472 }, "compute_output": { "count": 44, "ns": 1085655 } }, { "batch_size": 3, "compute_input": { "count": 5, "ns": 405684 }, "compute_infer": { "count": 5, "ns": 3312169 }, "compute_output": { "count": 5, "ns": 153794 } }, { "batch_size": 4, "compute_input": { "count": 1, "ns": 60297 }, "compute_infer": { "count": 1, "ns": 442083 }, "compute_output": { "count": 1, "ns": 31598 } }, { "batch_size": 5, "compute_input": { "count": 1, "ns": 56697 }, "compute_infer": { "count": 1, "ns": 439883 }, "compute_output": { "count": 1, "ns": 36998 } }, { "batch_size": 6, "compute_input": { "count": 4, "ns": 294887 }, "compute_infer": { "count": 4, "ns": 2072518 }, "compute_output": { "count": 4, "ns": 192193 } }, { "batch_size": 8, "compute_input": { "count": 1, "ns": 111295 }, "compute_infer": { "count": 1, "ns": 462581 }, "compute_output": { "count": 1, "ns": 50498 } } ] } ] }

Expected behavior Aggregated request compute durations should be reported correctly and match the durations reported for individual batch sizes

vkatms avatar Dec 28 '22 01:12 vkatms