PaddleSeg
PaddleSeg copied to clipboard
PP-HumanMatting has non-optimal graph
While optimizing PP-HumanMatting model I have found that the computational graph contains some weird and non-optimal pattern. Instead of using pad2d op, there is a combination of unsqueeze2 + pad3d + squeeze2 ops, which are behaving like pad2d, but are significantly slowing the model. I have written both pad3d and pad2d oneDNN kernels in PR: #43990. This PR sped up HumanMatting model by 30%, but to achieve even better performance changing unsqueeze2 + pad3d + squeeze2 patterns into pad2d is needed. Doing that will improve model's performance under oneDNN by another 13% compared to current profiling listed below.
Spotted pattern on humanmatting_model.zip:

Profiling measured on Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz after #43990:
------------------------- Overhead Summary -------------------------
Total time: 6985.58 Computation time Total: 6877.93 Ratio: 98.4589% Framework overhead Total: 107.654 Ratio: 1.5411%
------------------------- Event Summary -------------------------
| Event | Calls | Total | Min. | Max. | Ave. | Ratio. |
|---|---|---|---|---|---|---|
| thread0::conv2d | 900 | 3625.98 | 0.061789 | 52.8846 | 4.02886 | 0.519066 |
| thread0::bilinear_interp_v2 | 200 | 746.636 | 0.152614 | 36.586 | 3.73318 | 0.106882 |
| thread0::Executor::Run | 1 | 665.823 | 665.823 | 665.823 | 665.823 | 0.0953138 |
| Executor::RunPartialPreparedContext | 1 | 665.703 | 665.703 | 665.703 | 665.703 | 0.0952967 |
| load_combine | 1 | 665.569 | 665.569 | 665.569 | 665.569 | 0.0952776 |
| thread0::concat | 170 | 551.391 | 0.019189 | 39.612 | 3.24347 | 0.0789327 |
| thread0::squeeze2 | 30 | 449.727 | 1.65939 | 109.896 | 14.9909 | 0.0643793 |
| thread0::unsqueeze2 | 30 | 444.685 | 1.78944 | 110.22 | 14.8228 | 0.0636576 |
| thread0::arg_max | 10 | 153.128 | 13.949 | 16.8544 | 15.3128 | 0.0219206 |
| thread0::pad3d | 30 | 143.108 | 0.613291 | 16.0822 | 4.77026 | 0.0204862 |
| thread0::nearest_interp_v2 | 10 | 91.2025 | 6.93993 | 22.8784 | 9.12025 | 0.0130558 |
| thread0::slice | 80 | 35.9588 | 0.009735 | 3.30152 | 0.449485 | 0.00514757 |
| thread0::relu | 10 | 20.5054 | 1.76941 | 2.98284 | 2.05054 | 0.00293538 |
| thread0::pool2d | 70 | 19.2899 | 0.070717 | 0.963652 | 0.27557 | 0.00276139 |
| thread0::softmax | 10 | 15.4916 | 1.3888 | 1.69225 | 1.54916 | 0.00221766 |
| thread0::equal | 20 | 6.84826 | 0.267666 | 0.417393 | 0.342413 | 0.000980343 |
| thread0::elementwise_add | 40 | 5.39096 | 0.066211 | 0.4031 | 0.134774 | 0.000771727 |
| thread0::cast | 20 | 5.02093 | 0.183324 | 0.665627 | 0.251047 | 0.000718757 |
| thread0::elementwise_mul | 10 | 1.52323 | 0.11234 | 0.188003 | 0.152323 | 0.000218053 |
| thread0::sigmoid | 10 | 1.40188 | 0.096007 | 0.208529 | 0.140188 | 0.000200681 |
| thread0::shape | 40 | 0.898588 | 0.010386 | 0.084814 | 0.0224647 | 0.000128635 |
| thread0::scale | 20 | 0.694454 | 0.00987 | 0.102248 | 0.0347227 | 9.94125e-05 |
| thread0::fill_constant | 40 | 0.637646 | 0.00751 | 0.045786 | 0.0159411 | 9.12803e-05 |
| thread0::elementwise_floordiv | 20 | 0.241182 | 0.007927 | 0.023915 | 0.0120591 | 3.45257e-05 |
Thank you for your suggestion
Really looking forward to that performance boost
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.