OpenBLAS sched_yield spam

Openblas spams (40k interrupts/s) the sched_yield syscall, resulting in severe performance degradation.

Highest impact:

/usr/lib/x86_64-linux-gnu/libc-2.31.so(__sched_yield+0x7) [0xe14e7] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(sgemm_thread_nt+0xe35) [0x1f4c75] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(exec_blas+0xce) [0x348e8e] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(sgemm_thread_nt+0x4f3) [0x1f4333] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(sgemm_thread_tn+0x117) [0x1f5097] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(cblas_sgemm+0x449) [0x1215d9] [...] unexpected_backtracing_error [0x7fdcfc459831]

Less but still noticable impact:

/usr/lib/x86_64-linux-gnu/libc-2.31.so(__sched_yield+0x7) [0xe14e7] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(sgemm_thread_nt+0xe35) [0x1f4c75] [...]/openblas/linux-x86_64/libopenblas_nolapack.so.0(openblas_read_env+0x36d) [0x34885d] /usr/lib/x86_64-linux-gnu/libpthread-2.31.so(start_thread+0xd7) [0x7ea7] /usr/lib/x86_64-linux-gnu/libc-2.31.so(clone+0x3f) [0xfba2f]

pick_next_task_fair, pick_next_task, update_curr, etc.. are called by the sched_yield syscall. The percentages are global, that syscall needs >30% of my CPU

Feb 09 '24 06:02 Jpx3

What is your hardware and operating system please ? We've been there before, but actual performance degradation has mostly been seen on some laptop models and results for other solutions were a bit inconclusive. If you compile OpenBLAS yourself, find the single

ifndef YIELDING
#define YIELDING sched_yield()
#endif

in common.h and copy one of the asm __volatile__ ("nop,nop,nop\n"); lines from above it there to see if it makes a significant difference.

Feb 09 '24 07:02 martin-frbg

Among options were nanosleep(1) etc, but those also involve syscall-s which grow slower with more and more spectre code added to syscalls. YIELDING is what happens with thread when it has completed its computation and should kind of park itself until further work gets passed.

Feb 09 '24 09:02 brada4

http://vger.kernel.org/~acme/unbehaved.txt -> sched_yeild may place task at start of runq and make effective busy-loop....

Feb 09 '24 18:02 brada4

that's one of the reasons why I asked for os and hardware. so far the only reason to hold on to sched_yield as the default on many platforms is that it has performed at least marginally better than any proposed alternative whenever the topic came up.

Feb 09 '24 20:02 martin-frbg

AMD Ryzen 5 3600 Hexa-Core running Debian "bullseye" 11.8, if it helps :)

Feb 10 '24 01:02 Jpx3

not too different from my 4600H it would seem - I'll look up if Bullseye used a different scheduler

Feb 10 '24 08:02 martin-frbg

These seem to be related: https://github.com/OpenMathLib/OpenBLAS/issues/3660 https://github.com/OpenMathLib/OpenBLAS/issues/4063. From the link from @brada4:

One example can be found in recent changes to the iperf networking benchmark tool, that when the Linux kernel switched to the CFS scheduler exhibited a 30% drop in performance. Source code inspection showed that it was using sched_yield in a loop to check for a condition to be met. After a fix was made the performance drop disappeared and CPU consumption went down from 100% to saturate a gigabit network to just 9%[1].

Feb 10 '24 10:02 Jpx3

https://github.com/OpenMathLib/OpenBLAS/blob/b1ae777afb071f3a80e6646ceaa587c4d2e10d23/driver/others/blas_server.c#L851 It may be re-written to pthread_cond_* there and in 50 other places, if it is stable now tears later.

Feb 10 '24 11:02 brada4

one is MS Windows and the other is likely to be related to NUMA effects (dual socket)

Feb 10 '24 11:02 martin-frbg

Under some kernel configuration sched_yield turns into busy loop on one core. iperf change is 2 lines, but here it will take much more, like rewriting all thread work schedulers to completely prepare split work units then wait() on cond signals at other thread completion.

Feb 10 '24 11:02 brada4

you're running away with that idea while we don't even have ascertained which scheduler is in use, and if simply switching from sched_yield to nop is enough to alleviate the problem

Feb 10 '24 12:02 martin-frbg

Total time does not decrease, just less system time accounted.

Feb 10 '24 12:02 brada4

Thinking out loud...

exec_blas_async_wait() is a join point so waiting efficiently is the goal. The Windows code previously waited on an Event per queued task. But that was a poor pattern that created a lot of short-lived kernel objects. If threads are tasked as evenly as possible there should not be a lot of excess waiting due to asymmetry.

Feb 15 '24 22:02 mseminatore

pthread_barrier is atomic counter that makes 1 syscall to init and 1 syscall per thread to finish but you need to know counter value in advance which needs re-organising code carefully.

Feb 15 '24 23:02 brada4

I will make an attempt to replace all yields in level3_thread.c with proper locks when I find the time, I'll post if that fixes the problem

Feb 18 '24 11:02 Jpx3

barrier to gather all sub-tasks complete without polling.

Feb 18 '24 11:02 brada4

OpenBLAS OpenBLAS copied to clipboard

sched_yield spam

OpenBLAS
OpenBLAS copied to clipboard