Halide
Halide copied to clipboard
Performance regression on depthwise_separable_conv
I have a branch that is ~30 commits behind master (specifically up to date with 813eadc) that reports a manual schedule of depthwise_separable_conv of:
Manually-tuned time: 0.369731ms
Meanwhile, my current master branch (specifically up to date with 3e034d6) reports a manual schedule of:
Manually-tuned time: 0.590547ms
(Not including autoscheduling times because the branch has autoscheduler changes).
Per @abadams opening this issue to see if others can reproduce / diagnose
Can't repro on linux or os x with llvm 11. What llvm version are you using?
LLVM 11 on linux and LLVM 10 on mac os, both experience the slowdown
I still can't repro. Can you check the two commits you mention in clean checkouts? I.e. not as part of a branch?
Clean checkouts on linux machine: 3e034d6:
Manually-tuned time: 0.52348ms
813eadc:
Manually-tuned time: 0.360595ms
Can confirm clean checkouts on the mac later if necessary
Hmm, it's not nearly as bad on the mac. 3e034d6:
Manually-tuned time: 0.497357ms
813eadc:
Manually-tuned time: 0.426782ms
It is really important to provide hardware information. At least the number of cores. If there is any indication this might have to do with parallel scheduling, I would comment out the spin change, easiest way to do that is to set the spincount to instead of 40 at the top.
Ah sorry about that. Mac has 6, Linux has 24.
The main difference I'm finding in the generated assembly is the use of halide_mutex_(unlock + yield + lock) versus halide_cond_wait. Per @abadams , I set max_spin_count=0 in src/runtime/thread_pool_common.h and am now seeing:
Manually-tuned time: 0.140013ms
The earlier commit also achieves approximately this performance if I set HL_NUM_THREADS=12.
Yeah, that's enough of a difference that we should probably just revert that part of the earlier change. Will make a PR in the morning.
Did this ever get dealt with? Shall we close this?