onnxruntime Partition convolution utilities for parallel processing

Description: This change partitions Im2Col and Col2Im utilities for multi-threaded batching. This is done on the basis of the algo/data cost rather than on a per-channel basis. Multiple channels are batches together for processing on a single thread when there is little data per channel to process. We aim to make sure batches are no smaller than a certain threshold so we do not blindly divide data between all available threads, but only those that are needed. In the latter case, launching threads costs add more latency variance rather than speedup.

Motivation and Context This speeds up certain customer models.

Aug 04 '22 23:08 yuslepukhin

This pull request introduces 3 alerts and fixes 1 when merging dc9bef57393d344644043875d84ab6d339055488 into 37995a7245895c1116e67a90d72c4c568dac8282 - view on LGTM.com

new alerts:

2 for Commented-out code
1 for Unsigned comparison to zero

fixed alerts:

1 for Commented-out code

Aug 05 '22 01:08 lgtm-com[bot]

Changing so many operators all at once has too big of a performance impact. I suggest only modifying one operator first. And we need to have a set of 1p production models and benchmark models, run performance tests and make sure there is no perf degradation.

Aug 05 '22 05:08 chenfucn

im2col/col2im are almost always accompanied by GEMM. We should partition GEMM and im2col/col2im together. This way, we have one single parallel for in the operator instead of two back to back ones, reducing the cost of starting threads and waiting for all of them to finish.

Aug 05 '22 05:08 chenfucn

Changing so many operators all at once has too big of a performance impact. I suggest only modifying one operator first. And we need to have a set of 1p production models and benchmark models, run performance tests and make sure there is no perf degradation.

performance impact has been verified. The customer model gains better performance, while our performance builds do not indicate any regressions.

Aug 05 '22 17:08 yuslepukhin

Changing so many operators all at once has too big of a performance impact. I suggest only modifying one operator first. And we need to have a set of 1p production models and benchmark models, run performance tests and make sure there is no perf degradation.

performance impact has been verified. The customer model gains better performance, while our performance builds do not indicate any regressions.

you mean arubis? I don't think they have 1p production models. And do we have performance tests on ARM processors? Can you link the perf comparison results in the description?

Aug 05 '22 18:08 chenfucn

This pull request introduces 1 alert and fixes 1 when merging 71a13f80dd5f407dd5c0723c609fcce1736ed24d into e85e31ee803555f7b2eec4f7b11a1d8645b2c115 - view on LGTM.com

new alerts:

1 for Commented-out code

fixed alerts:

1 for Commented-out code

Aug 06 '22 00:08 lgtm-com[bot]

This pull request fixes 1 alert when merging b2420219eaadb78d7482da2ce44f3269824f55e8 into bdd6b00c9a6a5cdd2925762a9058e5c46f4a787f - view on LGTM.com

fixed alerts:

1 for Commented-out code

Aug 08 '22 19:08 lgtm-com[bot]