super-gradients icon indicating copy to clipboard operation
super-gradients copied to clipboard

Performance of PP-Lite-Seg

Open MaxEAB opened this issue 1 year ago • 3 comments

Hi,

I have been running some experiments on the pre-trained PP_LITE_T_SEG50 model. I get an inference frame rate of 81 fps for 512x1024 resolution images on a GTX 1080. Why is this lower than the reported 273 fps in the paper? (They benchmarked this on 1080 Ti, 512x1024 images). Even accounting for the differences between 1080 and 1080 Ti, the fps looks low.

Is this due to a difference in implementation or something else? Could you check?

Thanks,

MaxEAB avatar May 15 '23 17:05 MaxEAB

Hi @MaxEAB , first thing to make sure is to use TensorRT framework fo fair comparison with the paper result.

When benchamarking with trtexec tool there are two common conventions to define the latency performance metrics, with IO operations and without IO. IO latency is referred as the time to allocate input tensors to the runtime [GPU] hardware, and to retrieve outputs tensors from the runtime hardware to CPU. Those operations are model agnostic and depends solely on the input / output tensor sizes.

See belows our results for pplite-t-seg50 [512x1024 wrapped with downsample / upsample as described in the paper], with trt-8.5 and RTX 2080Ti. Unfortunately we don't hold the exact same hardware as the authors and trt7 used in the paper is a quite old version.

For without IO latency the GPU Compute Time should be considered, which yield 2.04ms / 490FPS.

For with IO latency the Latency should be considered, which yield 16.7ms / 60FPS. 14ms is the IO operations latency and is fixed values for all cityscapes model with the same input / output sizes.

The relation between those metrics is:

'Latency' = 'GPU Compute Time' + `H2D Latency` + 'D2H Latency'

See trtexec documentation.

Bottom line, I believe this issue is not a matter of what model is measured but how we measure and how latency is defined. The only way to fairly compared to the paper is to approach the original author for those details.

[05/15/2023-17:21:20] [I] === Performance summary ===
[05/15/2023-17:21:20] [I] Throughput: 80.2231 qps
[05/15/2023-17:21:20] [I] Latency: min = 16.3857 ms, max = 16.7418 ms, mean = 16.702 ms, median = 16.7065 ms, percentile(90%) = 16.7227 ms, percentile(95%) = 16.7273 ms, percentile(99%) = 16.7334 ms
[05/15/2023-17:21:20] [I] Enqueue Time: min = 1.22504 ms, max = 1.35376 ms, mean = 1.2537 ms, median = 1.25342 ms, percentile(90%) = 1.27466 ms, percentile(95%) = 1.28467 ms, percentile(99%) = 1.29614 ms
[05/15/2023-17:21:20] [I] H2D Latency: min = 2.23816 ms, max = 2.27429 ms, mean = 2.25087 ms, median = 2.24829 ms, percentile(90%) = 2.2627 ms, percentile(95%) = 2.26318 ms, percentile(99%) = 2.27017 ms
[05/15/2023-17:21:20] [I] GPU Compute Time: min = 2.03003 ms, max = 2.05176 ms, mean = 2.03723 ms, median = 2.03699 ms, percentile(90%) = 2.04126 ms, percentile(95%) = 2.04321 ms, percentile(99%) = 2.04712 ms
[05/15/2023-17:21:20] [I] D2H Latency: min = 12.0984 ms, max = 12.4397 ms, mean = 12.4139 ms, median = 12.4164 ms, percentile(90%) = 12.4254 ms, percentile(95%) = 12.4324 ms, percentile(99%) = 12.4333 ms

lkdci avatar May 15 '23 17:05 lkdci

Thanks for the detailed reply. Much appreciated.

MaxEAB avatar May 15 '23 18:05 MaxEAB