Andrew

Results 824 comments of Andrew

I will do tomorrow l regarding my alignment theory @fenrus75 there is no alihnment/chunk preference signaling in most cases, say l1 thread does know only number of elements in vector,...

> When input size is small, OpenBLAS still uses all availables cores, resulting in too much syscalls @jeremiedbb could you try in common.h ``` #ifndef YIELDING // #define YIELDING sched_yield()...

I think it is around places where now retired sched.compat_yield sysctl was operating, now we are stuck in the world with non-compat one. I think pthread_cond_wait is pthread equivalent not...

It used to be in BLAS.... Still in lapack 3.1.1 it will be 3/2 execution time of DOT (3 memory accesses in place of 2 for dot per multiplication), and...

Now imagine your sbmv multiplying 2 million-element vectors ... There is less wasteful way with _gemv or _gemm (i.e free dimension(s) ==1) (by magnitude slower than a loop) dot is...

Diagonal? It is not square.

swap dimensions of v2 and get HAD swap dimensions of v1 and get DOT

in common case one can treat matrices as 1:(M*N) vectors and apply marginal case of gemm / gemv

FUNCTION DHAD2(N,A,B,C) DGEMV('N',N,1,1.0,...

You can accelerate parts of Eigen using BLAS (OpenBLAS, MKL, Accelerate Framework) https://eigen.tuxfamily.org/dox-devel/TopicUsingBlasLapack.html The "this" function in MKL is not part of BLAS functions, but of other group, that is...