Andrew comments

Results 900 comments of


                                            Andrew

trafficstars

Incorrect result with `cblas_dgemv` vs reference netlib and other libraries

It will fall back to older compute kernels if you do not have AVX2 in CPUID. The difference is 1-2 youngest bits of significand and is expected. If you want...

sched_yield spam

Among options were nanosleep(1) etc, but those also involve syscall-s which grow slower with more and more spectre code added to syscalls. YIELDING is what happens with thread when it...

sched_yield spam

http://vger.kernel.org/~acme/unbehaved.txt -> sched_yeild may place task at start of runq and make effective busy-loop....

https://github.com/OpenMathLib/OpenBLAS/blob/b1ae777afb071f3a80e6646ceaa587c4d2e10d23/driver/others/blas_server.c#L851 It may be re-written to pthread_cond_\* there and in 50 other places, *if* it is stable now tears later.

sched_yield spam

Under some kernel configuration sched_yield turns into busy loop on one core. iperf change is 2 lines, but here it will take much more, like rewriting all thread work schedulers...

sched_yield spam

Total time does not decrease, just less system time accounted.

sched_yield spam

pthread_barrier is atomic counter that makes 1 syscall to init and 1 syscall per thread to finish but you need to know counter value in advance which needs re-organising code...

sched_yield spam

barrier to gather all sub-tasks complete without polling.

[Feature Request] cblas_zomatadd should be supported.

Cast matrices to vectors. Scal one and axpy other?

[Feature Request] cblas_zomatadd should be supported.

So make supercompat header to do the casts. BLAS L1 is quite well optimised by modern compilers.