Andrew
Andrew
I will do tomorrow l regarding my alignment theory @fenrus75 there is no alihnment/chunk preference signaling in most cases, say l1 thread does know only number of elements in vector,...
> When input size is small, OpenBLAS still uses all availables cores, resulting in too much syscalls @jeremiedbb could you try in common.h ``` #ifndef YIELDING // #define YIELDING sched_yield()...
I think it is around places where now retired sched.compat_yield sysctl was operating, now we are stuck in the world with non-compat one. I think pthread_cond_wait is pthread equivalent not...
It used to be in BLAS.... Still in lapack 3.1.1 it will be 3/2 execution time of DOT (3 memory accesses in place of 2 for dot per multiplication), and...
Now imagine your sbmv multiplying 2 million-element vectors ... There is less wasteful way with _gemv or _gemm (i.e free dimension(s) ==1) (by magnitude slower than a loop) dot is...
Diagonal? It is not square.
swap dimensions of v2 and get HAD swap dimensions of v1 and get DOT
in common case one can treat matrices as 1:(M*N) vectors and apply marginal case of gemm / gemv
FUNCTION DHAD2(N,A,B,C) DGEMV('N',N,1,1.0,...
You can accelerate parts of Eigen using BLAS (OpenBLAS, MKL, Accelerate Framework) https://eigen.tuxfamily.org/dox-devel/TopicUsingBlasLapack.html The "this" function in MKL is not part of BLAS functions, but of other group, that is...