dali_backend
dali_backend copied to clipboard
Batching does not improve performance with dali
Issue
Batching does not improve performance with dali.
Description
In summary, inference slows as we increase batching in our application.
We have an application that sends data to triton for inferencing. As mentioned above, batching does not seem to improve performance with dali. We are using an ensemble model that uses dali for preprocessing and then do object detection with yolo. Specifically, batch size of 8 is significantly slower than batch size of 1. We have only seen that with the dali portion of the application is much slower than the object-detection portion of the application.
Using perf-analyzer with batch sizes 1 and 8 with concurrency with 2 revealed improved inferences/sec as one might expect. However, this has not been observed in the application. Manual timing of the application has shown dali takes up the majority of the time with inferencing (object-detection seems to be fine).
It is worth mentioning that we are testing by sending batches in the application as well as using dynamic batching in triton, as can be seen below in the configs.
Perf Analyzer/Application Infer Timing
We ran our application and timed the average median milliseconds for inference with preprocessing and object detection for batch size 1 and 8. We also ran perf-analyzer for batch size 1 and 8 against triton using the configs provided below.
Batch 1 for object-detection avg request latency 4.623 ms and Batch 1 timing object-detection inferencing our application code is median avg 9.017ms. For batch 1 perf-analyzer, for concurrency: 2, we get throughput: 91.9781 infer/sec.
Batch 8 for object-detection avg request latency 10.749 ms and Batch 8 timing object-detection inferencing our application code is median avg 86.335ms. For batch 8 perf-analyzer, for concurrency: 2, we get throughput 170.247 infer/sec.
Additional information from perf-analyzer has been attached as a csv.
Config Information
Here is the configuration information:
ensemble- config.pbtxt
name: "ensemble"
platform: "ensemble"
max_batch_size: 8
input [
{
name: "frame"
data_type: TYPE_UINT8
dims: [ 1080, 1920, 3 ]
}
]
output [
{
name: "yolo_num_detections"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "yolo_detection_boxes"
data_type: TYPE_FP32
dims: [ 100, 4 ]
},
{
name: "yolo_detection_scores"
data_type: TYPE_FP32
dims: [ 100 ]
},
{
name: "yolo_detection_classes"
data_type: TYPE_INT32
dims: [ 100 ]
}
]
ensemble_scheduling {
step [
{
model_name: "preprocessing"
model_version: -1
input_map {
key: "raw_images"
value: "frame"
}
output_map [
{
key: "yolo_prep_output"
value: "yolo_preprocessed_image"
}
]
},
{
model_name: "object_detection"
model_version: -1
input_map [
{
key: "images"
value: "yolo_preprocessed_image"
}
]
output_map [
{
key: "num_detections"
value: "yolo_num_detections"
},
{
key: "detection_boxes"
value: "yolo_detection_boxes"
},
{
key: "detection_scores"
value: "yolo_detection_scores"
},
{
key: "detection_classes"
value: "yolo_detection_classes"
}
]
}
]
}
object detection - config.pbtxt
name: "object_detection"
platform: "tensorrt_plan"
max_batch_size: 8
input [
{
name: "images"
data_type: TYPE_FP32
dims: [ 3, 384, 640 ]
}
]
output [
{
name: "num_detections"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "detection_boxes"
data_type: TYPE_FP32
dims: [ 100, 4 ]
},
{
name: "detection_scores"
data_type: TYPE_FP32
dims: [ 100 ]
},
{
name: "detection_classes"
data_type: TYPE_INT32
dims: [ 100 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
dynamic_batching {
preferred_batch_size: [ 8 ]
}
pre-processing - config.pbtxt
name: "preprocessing"
backend: "dali"
max_batch_size: 8
input [
{
name: "raw_images"
data_type: TYPE_UINT8
dims: [ 1080, 1920, 3 ]
}
]
output [
{
name: "yolo_prep_output"
data_type: TYPE_FP32
dims: [ 3, 384, 640 ]
}
]
dynamic_batching {
preferred_batch_size: [ 8 ]
}
dali.py
import nvidia.dali as dali
import nvidia.dali.plugin.triton as triton
@triton.autoserialize
@dali.pipeline_def(batch_size=8, num_threads=4, device_id=0)
def pipe():
images = dali.fn.external_source(device="gpu", name="raw_images")
images = dali.fn.color_space_conversion(
images, image_type=dali.types.BGR, output_type=dali.types.RGB
)
# YOLO PRE-PROCESSING
images = dali.fn.resize(images, resize_x=640, resize_y=360)
pad = dali.types.Constant(
value=114,
dtype=dali.types.DALIDataType.UINT8,
shape=[12, 640, 3],
layout="HWC",
device="gpu"
)
yolo_images = dali.fn.cat(pad, images, pad, axis=0)
yolo_images = dali.fn.transpose(yolo_images, perm=[2, 0, 1])
# normalize
yolo_images = yolo_images / 255
return yolo_images
Questions
- Why does dali slow down when we introduce batching (by sending data in batches of 8 to triton) for our configurations and not match our perf-analyzer results?
- What are some additional things we can try in our configurations to improve performance or what are some clues for potential bottleneck?
Perf-Analyzer CSV Output
ensemble-concur2-ceiling8-batch8.csv ensemble-concur2-ceiling8-batch1.csv
Hello, @hly0025 Just looking at the numbers from the perf_analyzer, increasing batch size provides performance improvement, right? It increases the throughput twice and and the latency is increased only twice (vs 8 times bigger batch).
Can you tell how the time measurement in your application is done and do you suspect why could it yield such different perf results?
Hello @banasraf
Time Measurements
Thanks for your reply. Here is how the measurement was done. I timed, individually, the following for preprocessing
and object detection
separately in our application code (aka client side of things):
start_timer = time.perf_counter()
batch_results = self.triton_client.infer(
self.triton_config.model_name,
inputs=inputs,
outputs=outputs,
client_timeout=None,
compression_algorithm=None,
)
end = time.perf_counter()
self.write_to_csv("trt_infer_objectdetection", end - start_timer, tic_filename)
This would write to the file each time we called this function. I then took all the results in the csv file and got the average median time for preprocessing
and object detection
when sending from our application (client side) a batch of data either in size 1 or size 8.
This is how I would get median avg 9.017ms for doing inferencing with batch size 1 for object detection.
Additional Information
I realize, in the above, I only gave numbers for object detection
. Here is some brief stats on the preprocessing
:
Batch 1 for pre-processor avg request latency is 2.374 ms and Batch 1 timing pre-processor inferencing in our application code is median avg 21.94ms
Batch 8 for pre-processor avg request latency 10.749 ms and Batch 8 timing pre-processor inferencing in our application code is median avg 238.028ms
We are using triton tritonserver:22.08-py3
and the dali version supported by that, which is 1.16.0
.
Summary
For me, I think the biggest mystery is perf-analyzer
results of 10.749ms
at batch size 8 and the application (client side) sending a batch size of 8 being much slower at 238.02
ms. I would think with batching, it would be quicker.
@hly0025 ,
Thank you for the thorough analysis. I've got somewhat confused by all the numbers you provided, so I've put them in a table. Could you please verify, if all these numbers and descriptions are correct according to your data?
Numbers
Application end-to-end | Application end-to-end | Application preprocessing | Application preprocessing | perf_analyzer | perf_analyzer | |
---|---|---|---|---|---|---|
batch_size | Mean [ms] | Median [ms] | Mean [ms] | Median [ms] | Throughput [inf/s] | p50 Latency [ms] |
1 | 4.62 | 9.02 | 2.37 | 21.9 | 92 | 22 |
8 | 10.7 | 86.3 | 10.7 | 238 | 170 | 86.7 |
I'm especially confused about the 238 ms
and 9.02
measurements. It appears, that apart from these two, everything else makes perfect sense and I'll be explaining it in next paragraphs. Right now I'll skip the 238ms
and 9.02
, but please provide more details about it.
Analysis
First of all, let's note that the Application
and perf_analyzer
results are consistent with each other:
- For
batch_size=1
, Application preprocessingmedian=21.9
while perf_analyzermedian=22
, - For
batch_size=8
, Application preprocessingmedian=86.3
while perf_analyzermedian=86.7
.
Secondly, we shall also note, that the Application measurements are the latency measurements. While perf_analyzer also provides throughput, it is not measured by Application with the code snippet you've provided.
So the remaining question is, why there is no perf improvement when using batch_size=8
with DALI?. Actually, there is 2x improvement! Assuming, that we are inferencing with a batch_size=1
, the median latency of a single sample is just what the number gives, i.e. 22 ms
. However, assuming we have a batch_size=8
, the approximate median latency of a single sample is the measurement value, but divided by the batch size. Therefore it's about latency/batch_size = 86/8 = 10.8 ms
. As we can see, this number is about 2 times smaller than the single sample latency when using batch_size=1
.
Would this analysis be reasonable with regards to your environment and requirements? Please let me know if you have any questions. Also, in case something in my analysis looks incorrect, it would be great if you'd clarify the two measurements I mentioned above.
@szalpal
Thanks for your reply and thorough remarks. I appreciate your patience to comb through my explanation to try to understand the issue. On my side, I believe it is a good idea putting the numbers into a table. Here is, upon your request, my review of the numbers. Your analysis is very close, but an important distinction to make is that on the application side, I decided to time the infer time for object-detection and pre-processing separately.
For now, I summarize the data again, and I hope this better conveys the information:
Timing Numbers
Perf-analyzer Results
Here are the results obtained using perf analyzer using the triton configs I shared above:
batch_size | Throughput inf/sec | p50 latency [ms] | p95 latency [ms] | object-detection-latency [ms] | preprocessing-latency [ms] |
---|---|---|---|---|---|
1 | 91.9781 | 21.98 | 29.972 | 4.623 | 2.374 |
8 | 170.247 | 86.67 | 130.034 | 7.179 | 10.749 |
Application Results
Here are the results obtained using the timing code via triton.infer
. As stated above, I timed the object detection and pre-processing separately.
Batch Size | Object Detection Median Avg [ms] | Preprocessing Median Avg [ms] |
---|---|---|
1 | 9.017 | 21.94 |
8 | 86.335 | 238.028 |
Summary
The real confusion for me is that the application does not perform like I would expect at batch size 8. I hope separating out the results that I obtained via perf-analyzer versus those obtained from the application will help make things clearer. In my mind, the pre-processing is taking some time and I am not sure why.
To briefly recap, the application is guaranteed to send a batch size of 8 from the client side to triton. Always. Given the configs, my understanding is that Triton should process this batch of 8 from the client side as one batch and send it back. However, at least where pre-processing is concerned, it seems to be having an issue.
Thanks kindly again for your thorough remarks and response.
PS - I can time ensemble (doing it all at once) if you like. However, my hope is by digging into each pre-processor and object-detection separately, that would help with the diagnostics, so to speak.
@hly0025 ,
Thank you for clarifying the numbers. To be frank, I rather trust the perf_analyzer measurements and they actually look promising (2ms
for batch_size=1 and 10ms
for batch_size=8 provides a nice gain).
Could we take some time to verify, whether the Application measurements are reliable? I mean, perf_analyzer by default runs multiple iteration until the time measurements are stable enough. Could you tell, how many inference iterations you've run when taking these measurements? Also, it is natural that first few iterations will be slower because of memory allocation that happen underneath. Are you conducting the warmup before running the performance test? Could you also provide a little bit more statistics? You've measured median, is it possible to measure average and standard deviation? The more data you'd provide the higher chance we have to find the root cause of the discrepancy between the results.
@szalpal
Thank you for your reply, that makes sense.
Brief Recap
On the application side (client side), we set batch size to 8. This ensures that we are sending data in batch sizes of 8 to the triton_client.infer
. For iterations, I can count how many times we have an observation in our csv
timing file if you feel that would be useful. I'm not sure what you mean by a "warmup", but to address that issue and also your excellent point about outliers, we timed it for 15 approximately minutes.
start_timer = time.perf_counter()
batch_results = self.triton_client.infer(
self.triton_config.model_name,
inputs=inputs,
outputs=outputs,
client_timeout=None,
compression_algorithm=None,
)
end = time.perf_counter()
self.write_to_csv("trt_infer_objectdetection", end - start_timer, tic_filename)
For statistics, please see IQR, min, max, median, 25th, and 75 percentile.
Method | Min | Max | Median | IQR | 25th Percentile | 75th Percentile |
---|---|---|---|---|---|---|
Object Detection | 58.729ms | 175.925ms | 86.335ms | 22.646ms | 73.186ms | 95.831ms |
Preprocessing | 159.745ms | 897.239ms | 238.028ms | 69.040ms | 214.081ms | 283.122ms |
Summary
I can provide standard deviation and average if desired, but I hope the min, max, median, and the IQR helps address the core of what I hope you need to assess the application side of things more clearly. While this is admittedly, if you pardon the colloquial english expression: This is admittedly somewhat apples (application timing) to oranges (perf-analyzer). Nevertheless, I believe the preprocessing is still slower than I would anticipate based on the perf-analyzer.
Thank you for your remarks and questions.
Gentle inquiry @szalpal if there is a status update on this or thoughts? Thanks kindly in advance!
I also use dali as my model preprocess. No matter how the parameters are adjusted, the throughput of DALI does not improve and can only reach a maximum of 750. However, if I use nvJPEGDecMultipleInstances for decoding, the decoding efficiency can reach 2100.
I am using the COCO/val2017 dataset and running it on an A10.
dali pipe
def preprocessing(images, device='gpu'):
images_ori = dali.fn.decoders.image(
images, device="mixed", output_type=types.BGR)
return images_ori
@dali.pipeline_def(batch_size=32, num_threads=32, device_id=0)
def pipe():
images = dali.fn.external_source(
device="cpu", name="encoded", no_copy=True)
return preprocessing(images)
dali config.pbtxt
name: "dali_preprocess_yolo"
backend: "dali"
max_batch_size: 32
input [
{
name: "encoded"
data_type: TYPE_UINT8
dims: [ -1 ]
allow_ragged_batch: true
}
]
output [
{
name: "original"
data_type: TYPE_UINT8
dims: [ -1, -1, 3]
}
]
dynamic_batching {
preferred_batch_size: [32]
max_queue_delay_microseconds: 100000
}
parameters: [
{
key: "num_threads"
value: { string_value: "32" }
}
]
instance_group [
{
count: 4
kind: KIND_GPU
gpus: [ 0 ]
}
]
perf_analyzer parameters
perf_analyzer -i grpc -u $HTTP_ADDR -p$TIME_WINDOW -m bls_async_pre1 --input-data dataset.json --concurrency-range=64:192:64
result
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 10000 msec
Latency limit: 0 msec
Concurrency limit: 192 concurrent requests
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 64
Client:
Request count: 27480
Throughput: 755.607 infer/sec
Avg latency: 84710 usec (standard deviation 2205 usec)
p50 latency: 30150 usec
p90 latency: 40957 usec
p95 latency: 589347 usec
p99 latency: 624066 usec
Avg gRPC time: 84695 usec ((un)marshal request/response 30 usec + response wait 84665 usec)
Server:
Inference count: 27489
Execution count: 866
Successful request count: 27489
Avg request latency: 103727 usec (overhead 24 usec + queue 8277 usec + compute input 97 usec + compute infer 8180 usec + compute output 87148 usec)
Request concurrency: 128
Client:
Request count: 26264
Throughput: 721.068 infer/sec
Avg latency: 174870 usec (standard deviation 21386 usec)
p50 latency: 51439 usec
p90 latency: 677091 usec
p95 latency: 698442 usec
p99 latency: 783479 usec
Avg gRPC time: 174866 usec ((un)marshal request/response 50 usec + response wait 174816 usec)
Server:
Inference count: 26163
Execution count: 825
Successful request count: 26163
Avg request latency: 198526 usec (overhead 29 usec + queue 30932 usec + compute input 123 usec + compute infer 13009 usec + compute output 154433 usec)
Request concurrency: 192
Client:
Request count: 27668
Throughput: 756.204 infer/sec
Avg latency: 252569 usec (standard deviation 19340 usec)
p50 latency: 78878 usec
p90 latency: 701562 usec
p95 latency: 720703 usec
p99 latency: 1253034 usec
Avg gRPC time: 252548 usec ((un)marshal request/response 52 usec + response wait 252496 usec)
Server:
Inference count: 27680
Execution count: 865
Successful request count: 27680
Avg request latency: 280895 usec (overhead 29 usec + queue 117216 usec + compute input 123 usec + compute infer 13554 usec + compute output 149972 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 64, throughput: 755.607 infer/sec, latency 84710 usec
Concurrency: 128, throughput: 721.068 infer/sec, latency 174870 usec
Concurrency: 192, throughput: 756.204 infer/sec, latency 252569 usec
nvJPEGDecMultipleInstances parameters
./nvJPEGDecMultipleInstances -i /mnt/share/2/dataset/coco/images/val2017/ -j 16 -b 16 -batched -t 10000 -w 100 -fmt unchanged
nvJPEGDecMultipleInstances result
Total decoding time: 4.69204 (s) Avg decoding time per image: 0.000469204 (s) Avg images per sec: 2131.27 Avg decoding time per batch: 0.00750727 (s) params.num_threads: 16 params.batch_size: 16
@qihang720 ,
In the snippet you've provided (perf_analyzer parameters), I see you're benchmarking bls_async_pre1
model, not the dali_preprocess_yolo
model. Could you double check, if the numbers you've provided are for dali_preprocess_yolo
model?
I'm so sorry for my delayed response.
I check my command,previously I use bls_async_pre1 to do benchmark, this is using python backend to do dali pipeline.
next result is using dali backend , Using the Dali backend is faster than using the Python backend, but GPU performance never reaches its maximum.
Successfully read data for 1 stream/streams with 5000 step/steps.
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 100000 msec
Latency limit: 0 msec
Concurrency limit: 192 concurrent requests
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 64
Client:
Request count: 314744
Throughput: 873.201 infer/sec
Avg latency: 73211 usec (standard deviation 5031 usec)
p50 latency: 33205 usec
p90 latency: 53482 usec
p95 latency: 429073 usec
p99 latency: 545507 usec
Avg gRPC time: 73189 usec ((un)marshal request/response 41 usec + response wait 73148 usec)
Server:
Inference count: 314713
Execution count: 13860
Successful request count: 314713
Avg request latency: 83935 usec (overhead 27 usec + queue 5470 usec + compute input 111 usec + compute infer 10189 usec + compute output 68137 usec)
Request concurrency: 128
Client:
Request count: 320336
Throughput: 888.4 infer/sec
Avg latency: 144054 usec (standard deviation 4045 usec)
p50 latency: 53815 usec
p90 latency: 502949 usec
p95 latency: 530911 usec
p99 latency: 642329 usec
Avg gRPC time: 144028 usec ((un)marshal request/response 52 usec + response wait 143976 usec)
Server:
Inference count: 320296
Execution count: 11586
Successful request count: 320296
Avg request latency: 160229 usec (overhead 33 usec + queue 6215 usec + compute input 155 usec + compute infer 13166 usec + compute output 140659 usec)
Request concurrency: 192
Client:
Request count: 305806
Throughput: 848.006 infer/sec
Avg latency: 226562 usec (standard deviation 6402 usec)
p50 latency: 79282 usec
p90 latency: 619427 usec
p95 latency: 662947 usec
p99 latency: 760955 usec
Avg gRPC time: 226536 usec ((un)marshal request/response 60 usec + response wait 226476 usec)
Server:
Inference count: 305762
Execution count: 9670
Successful request count: 305762
Avg request latency: 251549 usec (overhead 38 usec + queue 17388 usec + compute input 199 usec + compute infer 16426 usec + compute output 217497 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 64, throughput: 873.201 infer/sec, latency 73211 usec
Concurrency: 128, throughput: 888.4 infer/sec, latency 144054 usec
Concurrency: 192, throughput: 848.006 infer/sec, latency 226562 usec