PaddleSeg icon indicating copy to clipboard operation
PaddleSeg copied to clipboard

PP-HumanMatting has non-optimal graph

Open jakpiase opened this issue 3 years ago • 2 comments

While optimizing PP-HumanMatting model I have found that the computational graph contains some weird and non-optimal pattern. Instead of using pad2d op, there is a combination of unsqueeze2 + pad3d + squeeze2 ops, which are behaving like pad2d, but are significantly slowing the model. I have written both pad3d and pad2d oneDNN kernels in PR: #43990. This PR sped up HumanMatting model by 30%, but to achieve even better performance changing unsqueeze2 + pad3d + squeeze2 patterns into pad2d is needed. Doing that will improve model's performance under oneDNN by another 13% compared to current profiling listed below.

Spotted pattern on humanmatting_model.zip: image

Profiling measured on Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz after #43990:

------------------------- Overhead Summary -------------------------

Total time: 6985.58 Computation time Total: 6877.93 Ratio: 98.4589% Framework overhead Total: 107.654 Ratio: 1.5411%

------------------------- Event Summary -------------------------

Event Calls Total Min. Max. Ave. Ratio.
thread0::conv2d 900 3625.98 0.061789 52.8846 4.02886 0.519066
thread0::bilinear_interp_v2 200 746.636 0.152614 36.586 3.73318 0.106882
thread0::Executor::Run 1 665.823 665.823 665.823 665.823 0.0953138
Executor::RunPartialPreparedContext 1 665.703 665.703 665.703 665.703 0.0952967
load_combine 1 665.569 665.569 665.569 665.569 0.0952776
thread0::concat 170 551.391 0.019189 39.612 3.24347 0.0789327
thread0::squeeze2 30 449.727 1.65939 109.896 14.9909 0.0643793
thread0::unsqueeze2 30 444.685 1.78944 110.22 14.8228 0.0636576
thread0::arg_max 10 153.128 13.949 16.8544 15.3128 0.0219206
thread0::pad3d 30 143.108 0.613291 16.0822 4.77026 0.0204862
thread0::nearest_interp_v2 10 91.2025 6.93993 22.8784 9.12025 0.0130558
thread0::slice 80 35.9588 0.009735 3.30152 0.449485 0.00514757
thread0::relu 10 20.5054 1.76941 2.98284 2.05054 0.00293538
thread0::pool2d 70 19.2899 0.070717 0.963652 0.27557 0.00276139
thread0::softmax 10 15.4916 1.3888 1.69225 1.54916 0.00221766
thread0::equal 20 6.84826 0.267666 0.417393 0.342413 0.000980343
thread0::elementwise_add 40 5.39096 0.066211 0.4031 0.134774 0.000771727
thread0::cast 20 5.02093 0.183324 0.665627 0.251047 0.000718757
thread0::elementwise_mul 10 1.52323 0.11234 0.188003 0.152323 0.000218053
thread0::sigmoid 10 1.40188 0.096007 0.208529 0.140188 0.000200681
thread0::shape 40 0.898588 0.010386 0.084814 0.0224647 0.000128635
thread0::scale 20 0.694454 0.00987 0.102248 0.0347227 9.94125e-05
thread0::fill_constant 40 0.637646 0.00751 0.045786 0.0159411 9.12803e-05
thread0::elementwise_floordiv 20 0.241182 0.007927 0.023915 0.0120591 3.45257e-05

jakpiase avatar Jun 30 '22 17:06 jakpiase

Thank you for your suggestion

wuyefeilin avatar Jul 05 '22 03:07 wuyefeilin

Really looking forward to that performance boost

wrobcio789 avatar Aug 19 '22 17:08 wrobcio789

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Dec 09 '22 17:12 github-actions[bot]