OpenBLAS
OpenBLAS copied to clipboard
Hadamard product?
Elementwise vector product (like dot, but without the reduction to one number) Does it exist?
In MKL its vdMul, in cuBLAS its DHAD (so, not part of the standard BLAS function list)...
Hadamard product is completely central in machine learning, and is the only reason which prevents me from using OpenBLAS...
Does not exist currently but sounds interesting. Would you happen to have some rough number for the speedup from using MKL compared to just letting a modern compiler optimize a naive implementation ?
It used to be in BLAS.... Still in lapack 3.1.1 it will be 3/2 execution time of DOT (3 memory accesses in place of 2 for dot per multiplication), and _dot is only C Probably C macro will perform much better than a function call...
@brada4 the only trace of this function appears to be a spurious entry in the intro_blas1 manpage distributed with lapack 3.1.1
Just learned there is a walkaround by using sbmv http://stackoverflow.com/questions/7621520/element-wise-vector-vector-multiplication-in-blas
Elementwise matrix*matrix product would still be nice...
Not sure abusing sbmv for this would be more efficient than a simple loop over all elements ? BTW the only references for *HAD functions in cuBLAS appear to be people asking if there are any...
Now imagine your sbmv multiplying 2 million-element vectors ... There is less wasteful way with _gemv or _gemm (i.e free dimension(s) ==1) (by magnitude slower than a loop) dot is simple loop. I say macro because for small samples library call would be enormous time waste.
less wasteful way with _gemv or _gemm (i.e free dimension(s) ==1)
Please explain... Do you extract the diagonal from the resulting matrix somehow? How is it less wasteful?
Diagonal? It is not square.
Not sure I get the gemm/gemv idea either, but it could be instructive to see a quick benchmark comparison of that sbmv suggestion from stackoverflow against a simple loop that the compiler can unroll as needed. As far as I could find out, the *HAD functions were a brainchild of Cray/SGI in the mid-1990s (proposed for future inclusion in LAPACK at some netlib developer meeting in 1995) and part of their SCSF library but never universally supported.
The only way I see with gemv/gemm is to multiply two vectors v1T * v2 (of the same size), which returns a square matrix, and the diagonal is actually the hadamard product. I spent the whole weekend looking if there were some tricks with lda/incx, but I'm too dumb to find anything. Maybe there is? :)
sbmv allows the input matrix to be stored in banded matrix form. With number of super-diagonals = 0 you only need to supply the diagonal, which in our case is one of the vectors, so it's not so bad...
I should explain the context to my problem: I'm on a work computer where I do not have admin rights, and cannot install or compile anything. I use Excel VBA to declare function calls to libopenblas.dll in order to do some matrix crunching. (When I write that hadamard is "the only reason which prevents me from using OpenBLAS" -- I'm bluffing. mkl_rt.dll does not allow stdcall declarations from VBA, or maybe there is another reason, in any case I can only use OpenBLAS.) So, my two options are sbmv or a ~~C~~ ~~Macro~~ VBA macro for hadamard, and ideally I would want to have it outside of VBA...
swap dimensions of v2 and get HAD swap dimensions of v1 and get DOT
The Hadamard product is obviously a memory bounded operation, so the only thing one had to take care of is that the matrices are not traversed orthogonal to their storage scheme. If the matrices are accesses in the right way one has 3 loads, 1 multiplication and one store. The only point I see where one can optimize something is if we assume $ C := \alpha A\circle B + \beta C $ with $\beta=0$ and using non-temporal stores for C. In this case we only have 2 loads and one store per multiplication but requires aligned access to C and C must be larger than the L3 cache of current Intel-CPUs to give good results.
in common case one can treat matrices as 1:(M*N) vectors and apply marginal case of gemm / gemv
I still don't see it. Please give us a concrete example Assume size(v1) = size(v2)
FUNCTION DHAD2(N,A,B,C) DGEMV('N',N,1,1.0,...
dgemv "N", n, 1, 1.0, v1, n, v2, 1, 1.0, res, 1 This returns v1 multiplied by the first element in v2...
I think we can do hadamard products using xtbmv with diagonal matrix A as the following:
cblas_dtbmv(CblasColMajor, CblasUpper, CblasNoTrans, CblasNonUnit, n, 0, p, 1, q, 1);
This code does hadamard product of vectors p and q whose length are n,which writes result to q.
Sorry, I recommended to use xtbmv for Hadamard product.
However, according to tbmv_thread.c, using dtbmv for Hadamard
product is inefficient (calling xaxpy for each elements?), so I recommend
to use your own Hadamard product routine in C.
Furthermore, in optimized codes on some platforms, there are mysterious
bugs, and I have no idea to solve its problem, because it happens sometimes...
@personalityson certainly an unexpected but interesting use case. I do now understand the merit of having the *had functions in OpenBLAS even if just for "convenience" without any particular speed benefit. Coding them as simple multiplication loops is very likely to be more efficient than "abusing" a special case of some more general function. As regards the mysterious bugs in optimized code, I trust the common goal is to get rid of them :-)
Btw, the stdcall requirement normally needed for dlls to work in VBA only applies to 32bit dlls I saw an old feature request from 2014 also related to VBA: https://github.com/xianyi/OpenBLAS/issues/423 In any case 64bit bins work fine in 64bit version of Office
Looking at the declaration of xHAD which has alpha and beta multipliers, incx, incy and even incz for a sparse result vector, sbmv/tbmv actually could be a good choice. The MKL-style vxMul on the other hand appears to support only the simple multiplication of N elements of vectors x and y to yield vector z.
Hello everybody, I am looking forward this function too! My program does a video tracking task like this OpenCV tracking algorithm. The algorithm does a lot of point wise complex product in Fourier domain. The implementation in OpenCV is cv::mulSpectrum() while it is the bottle neck. My program will run on multi platform such as a x86 and ARM. OpenBLAS is my first choice from consideration of cross-platform and high performance. However, it is a pity to lack point wise vector product. Eigen and MKL do have this function.
You can accelerate parts of Eigen using BLAS (OpenBLAS, MKL, Accelerate Framework) https://eigen.tuxfamily.org/dox-devel/TopicUsingBlasLapack.html The "this" function in MKL is not part of BLAS functions, but of other group, that is not provided in OpenBLAS.
Hi folks, I don't have anything to add other than to second @loadwiki , this would be a useful addition.
It is in all viable macro libraries, memory bound, there wont be any improvement in speed adding it here over generic BLAS. I'd bet on modern compilers actually getting totally naive loop reasonably vectorised.
there wont be any improvement in speed
err, not sure I'd want to underwrite that...
Non-destructive AXPY that is 3 memory cells per MUL in place of original 2 ....