PaddleSeg PP-HumanMatting has non-optimal graph

While optimizing PP-HumanMatting model I have found that the computational graph contains some weird and non-optimal pattern. Instead of using pad2d op, there is a combination of unsqueeze2 + pad3d + squeeze2 ops, which are behaving like pad2d, but are significantly slowing the model. I have written both pad3d and pad2d oneDNN kernels in PR: #43990. This PR sped up HumanMatting model by 30%, but to achieve even better performance changing unsqueeze2 + pad3d + squeeze2 patterns into pad2d is needed. Doing that will improve model's performance under oneDNN by another 13% compared to current profiling listed below.

Spotted pattern on humanmatting_model.zip:

Profiling measured on Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz after #43990:

------------------------- Overhead Summary -------------------------

Total time: 6985.58 Computation time Total: 6877.93 Ratio: 98.4589% Framework overhead Total: 107.654 Ratio: 1.5411%

------------------------- Event Summary -------------------------

Event	Calls	Total	Min.	Max.	Ave.	Ratio.
thread0::conv2d	900	3625.98	0.061789	52.8846	4.02886	0.519066
thread0::bilinear_interp_v2	200	746.636	0.152614	36.586	3.73318	0.106882
thread0::Executor::Run	1	665.823	665.823	665.823	665.823	0.0953138
Executor::RunPartialPreparedContext	1	665.703	665.703	665.703	665.703	0.0952967
load_combine	1	665.569	665.569	665.569	665.569	0.0952776
thread0::concat	170	551.391	0.019189	39.612	3.24347	0.0789327
thread0::squeeze2	30	449.727	1.65939	109.896	14.9909	0.0643793
thread0::unsqueeze2	30	444.685	1.78944	110.22	14.8228	0.0636576
thread0::arg_max	10	153.128	13.949	16.8544	15.3128	0.0219206
thread0::pad3d	30	143.108	0.613291	16.0822	4.77026	0.0204862
thread0::nearest_interp_v2	10	91.2025	6.93993	22.8784	9.12025	0.0130558
thread0::slice	80	35.9588	0.009735	3.30152	0.449485	0.00514757
thread0::relu	10	20.5054	1.76941	2.98284	2.05054	0.00293538
thread0::pool2d	70	19.2899	0.070717	0.963652	0.27557	0.00276139
thread0::softmax	10	15.4916	1.3888	1.69225	1.54916	0.00221766
thread0::equal	20	6.84826	0.267666	0.417393	0.342413	0.000980343
thread0::elementwise_add	40	5.39096	0.066211	0.4031	0.134774	0.000771727
thread0::cast	20	5.02093	0.183324	0.665627	0.251047	0.000718757
thread0::elementwise_mul	10	1.52323	0.11234	0.188003	0.152323	0.000218053
thread0::sigmoid	10	1.40188	0.096007	0.208529	0.140188	0.000200681
thread0::shape	40	0.898588	0.010386	0.084814	0.0224647	0.000128635
thread0::scale	20	0.694454	0.00987	0.102248	0.0347227	9.94125e-05
thread0::fill_constant	40	0.637646	0.00751	0.045786	0.0159411	9.12803e-05
thread0::elementwise_floordiv	20	0.241182	0.007927	0.023915	0.0120591	3.45257e-05

Jun 30 '22 17:06 jakpiase

Thank you for your suggestion

Jul 05 '22 03:07 wuyefeilin

Really looking forward to that performance boost

Aug 19 '22 17:08 wrobcio789

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

Dec 09 '22 17:12 github-actions[bot]

PaddleSeg PaddleSeg copied to clipboard

PP-HumanMatting has non-optimal graph

PaddleSeg
PaddleSeg copied to clipboard