Andrew

Results 900 comments of Andrew
trafficstars

It will fall back to older compute kernels if you do not have AVX2 in CPUID. The difference is 1-2 youngest bits of significand and is expected. If you want...

Among options were nanosleep(1) etc, but those also involve syscall-s which grow slower with more and more spectre code added to syscalls. YIELDING is what happens with thread when it...

http://vger.kernel.org/~acme/unbehaved.txt -> sched_yeild may place task at start of runq and make effective busy-loop....

https://github.com/OpenMathLib/OpenBLAS/blob/b1ae777afb071f3a80e6646ceaa587c4d2e10d23/driver/others/blas_server.c#L851 It may be re-written to pthread_cond_\* there and in 50 other places, *if* it is stable now tears later.

Under some kernel configuration sched_yield turns into busy loop on one core. iperf change is 2 lines, but here it will take much more, like rewriting all thread work schedulers...

Total time does not decrease, just less system time accounted.

pthread_barrier is atomic counter that makes 1 syscall to init and 1 syscall per thread to finish but you need to know counter value in advance which needs re-organising code...

barrier to gather all sub-tasks complete without polling.

Cast matrices to vectors. Scal one and axpy other?

So make supercompat header to do the casts. BLAS L1 is quite well optimised by modern compilers.