Andrew
Andrew
It will fall back to older compute kernels if you do not have AVX2 in CPUID. The difference is 1-2 youngest bits of significand and is expected. If you want...
Among options were nanosleep(1) etc, but those also involve syscall-s which grow slower with more and more spectre code added to syscalls. YIELDING is what happens with thread when it...
http://vger.kernel.org/~acme/unbehaved.txt -> sched_yeild may place task at start of runq and make effective busy-loop....
https://github.com/OpenMathLib/OpenBLAS/blob/b1ae777afb071f3a80e6646ceaa587c4d2e10d23/driver/others/blas_server.c#L851 It may be re-written to pthread_cond_\* there and in 50 other places, *if* it is stable now tears later.
Under some kernel configuration sched_yield turns into busy loop on one core. iperf change is 2 lines, but here it will take much more, like rewriting all thread work schedulers...
Total time does not decrease, just less system time accounted.
pthread_barrier is atomic counter that makes 1 syscall to init and 1 syscall per thread to finish but you need to know counter value in advance which needs re-organising code...
barrier to gather all sub-tasks complete without polling.
Cast matrices to vectors. Scal one and axpy other?
So make supercompat header to do the casts. BLAS L1 is quite well optimised by modern compilers.