Halide icon indicating copy to clipboard operation
Halide copied to clipboard

Performance regression on depthwise_separable_conv

Open rootjalex opened this issue 4 years ago • 10 comments

I have a branch that is ~30 commits behind master (specifically up to date with 813eadc) that reports a manual schedule of depthwise_separable_conv of:

Manually-tuned time: 0.369731ms

Meanwhile, my current master branch (specifically up to date with 3e034d6) reports a manual schedule of:

Manually-tuned time: 0.590547ms

(Not including autoscheduling times because the branch has autoscheduler changes).

Per @abadams opening this issue to see if others can reproduce / diagnose

rootjalex avatar Feb 09 '21 17:02 rootjalex

Can't repro on linux or os x with llvm 11. What llvm version are you using?

abadams avatar Feb 10 '21 19:02 abadams

LLVM 11 on linux and LLVM 10 on mac os, both experience the slowdown

rootjalex avatar Feb 10 '21 20:02 rootjalex

I still can't repro. Can you check the two commits you mention in clean checkouts? I.e. not as part of a branch?

abadams avatar Feb 10 '21 20:02 abadams

Clean checkouts on linux machine: 3e034d6:

Manually-tuned time: 0.52348ms

813eadc:

Manually-tuned time: 0.360595ms

Can confirm clean checkouts on the mac later if necessary

rootjalex avatar Feb 10 '21 22:02 rootjalex

Hmm, it's not nearly as bad on the mac. 3e034d6:

Manually-tuned time: 0.497357ms

813eadc:

Manually-tuned time: 0.426782ms

rootjalex avatar Feb 10 '21 23:02 rootjalex

It is really important to provide hardware information. At least the number of cores. If there is any indication this might have to do with parallel scheduling, I would comment out the spin change, easiest way to do that is to set the spincount to instead of 40 at the top.

zvookin avatar Feb 10 '21 23:02 zvookin

Ah sorry about that. Mac has 6, Linux has 24.

rootjalex avatar Feb 10 '21 23:02 rootjalex

The main difference I'm finding in the generated assembly is the use of halide_mutex_(unlock + yield + lock) versus halide_cond_wait. Per @abadams , I set max_spin_count=0 in src/runtime/thread_pool_common.h and am now seeing:

Manually-tuned time: 0.140013ms

The earlier commit also achieves approximately this performance if I set HL_NUM_THREADS=12.

rootjalex avatar Feb 11 '21 00:02 rootjalex

Yeah, that's enough of a difference that we should probably just revert that part of the earlier change. Will make a PR in the morning.

abadams avatar Feb 11 '21 01:02 abadams

Did this ever get dealt with? Shall we close this?

steven-johnson avatar Jul 12 '22 18:07 steven-johnson