OpenBLAS
OpenBLAS copied to clipboard
sched_yield spam
Openblas spams (40k interrupts/s) the sched_yield syscall, resulting in severe performance degradation.
Highest impact:
/usr/lib/x86_64-linux-gnu/libc-2.31.so(__sched_yield+0x7) [0xe14e7] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(sgemm_thread_nt+0xe35) [0x1f4c75] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(exec_blas+0xce) [0x348e8e] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(sgemm_thread_nt+0x4f3) [0x1f4333] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(sgemm_thread_tn+0x117) [0x1f5097] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(cblas_sgemm+0x449) [0x1215d9] [...] unexpected_backtracing_error [0x7fdcfc459831]
Less but still noticable impact:
/usr/lib/x86_64-linux-gnu/libc-2.31.so(__sched_yield+0x7) [0xe14e7] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(sgemm_thread_nt+0xe35) [0x1f4c75] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(openblas_read_env+0x36d) [0x34885d] /usr/lib/x86_64-linux-gnu/libpthread-2.31.so(start_thread+0xd7) [0x7ea7] /usr/lib/x86_64-linux-gnu/libc-2.31.so(clone+0x3f) [0xfba2f]
pick_next_task_fair, pick_next_task, update_curr, etc.. are called by the sched_yield syscall.
The percentages are global, that syscall needs >30% of my CPU
What is your hardware and operating system please ? We've been there before, but actual performance degradation has mostly been seen on some laptop models and results for other solutions were a bit inconclusive. If you compile OpenBLAS yourself, find the single
ifndef YIELDING
#define YIELDING sched_yield()
#endif
in common.h and copy one of the asm __volatile__ ("nop,nop,nop\n"); lines from above it there to see if it makes a significant difference.
Among options were nanosleep(1) etc, but those also involve syscall-s which grow slower with more and more spectre code added to syscalls. YIELDING is what happens with thread when it has completed its computation and should kind of park itself until further work gets passed.
http://vger.kernel.org/~acme/unbehaved.txt -> sched_yeild may place task at start of runq and make effective busy-loop....
that's one of the reasons why I asked for os and hardware. so far the only reason to hold on to sched_yield as the default on many platforms is that it has performed at least marginally better than any proposed alternative whenever the topic came up.
AMD Ryzen 5 3600 Hexa-Core running Debian "bullseye" 11.8, if it helps :)
not too different from my 4600H it would seem - I'll look up if Bullseye used a different scheduler
These seem to be related: https://github.com/OpenMathLib/OpenBLAS/issues/3660 https://github.com/OpenMathLib/OpenBLAS/issues/4063. From the link from @brada4:
One example can be found in recent changes to the iperf networking benchmark tool, that when the Linux kernel switched to the CFS scheduler exhibited a 30% drop in performance. Source code inspection showed that it was using sched_yield in a loop to check for a condition to be met. After a fix was made the performance drop disappeared and CPU consumption went down from 100% to saturate a gigabit network to just 9%[1].
https://github.com/OpenMathLib/OpenBLAS/blob/b1ae777afb071f3a80e6646ceaa587c4d2e10d23/driver/others/blas_server.c#L851 It may be re-written to pthread_cond_* there and in 50 other places, if it is stable now tears later.
one is MS Windows and the other is likely to be related to NUMA effects (dual socket)
Under some kernel configuration sched_yield turns into busy loop on one core. iperf change is 2 lines, but here it will take much more, like rewriting all thread work schedulers to completely prepare split work units then wait() on cond signals at other thread completion.
you're running away with that idea while we don't even have ascertained which scheduler is in use, and if simply switching from sched_yield to nop is enough to alleviate the problem
Total time does not decrease, just less system time accounted.
Thinking out loud...
exec_blas_async_wait() is a join point so waiting efficiently is the goal. The Windows code previously waited on an Event per queued task. But that was a poor pattern that created a lot of short-lived kernel objects. If threads are tasked as evenly as possible there should not be a lot of excess waiting due to asymmetry.
pthread_barrier is atomic counter that makes 1 syscall to init and 1 syscall per thread to finish but you need to know counter value in advance which needs re-organising code carefully.
I will make an attempt to replace all yields in level3_thread.c with proper locks when I find the time, I'll post if that fixes the problem
barrier to gather all sub-tasks complete without polling.