OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Hadamard product?

Open personalityson opened this issue 8 years ago • 27 comments

Elementwise vector product (like dot, but without the reduction to one number) Does it exist?

In MKL its vdMul, in cuBLAS its DHAD (so, not part of the standard BLAS function list)...

Hadamard product is completely central in machine learning, and is the only reason which prevents me from using OpenBLAS...

personalityson avatar Feb 04 '17 18:02 personalityson

Does not exist currently but sounds interesting. Would you happen to have some rough number for the speedup from using MKL compared to just letting a modern compiler optimize a naive implementation ?

martin-frbg avatar Feb 04 '17 19:02 martin-frbg

It used to be in BLAS.... Still in lapack 3.1.1 it will be 3/2 execution time of DOT (3 memory accesses in place of 2 for dot per multiplication), and _dot is only C Probably C macro will perform much better than a function call...

brada4 avatar Feb 04 '17 20:02 brada4

@brada4 the only trace of this function appears to be a spurious entry in the intro_blas1 manpage distributed with lapack 3.1.1

martin-frbg avatar Feb 04 '17 22:02 martin-frbg

Just learned there is a walkaround by using sbmv http://stackoverflow.com/questions/7621520/element-wise-vector-vector-multiplication-in-blas

Elementwise matrix*matrix product would still be nice...

personalityson avatar Feb 04 '17 22:02 personalityson

Not sure abusing sbmv for this would be more efficient than a simple loop over all elements ? BTW the only references for *HAD functions in cuBLAS appear to be people asking if there are any...

martin-frbg avatar Feb 04 '17 23:02 martin-frbg

Now imagine your sbmv multiplying 2 million-element vectors ... There is less wasteful way with _gemv or _gemm (i.e free dimension(s) ==1) (by magnitude slower than a loop) dot is simple loop. I say macro because for small samples library call would be enormous time waste.

brada4 avatar Feb 05 '17 00:02 brada4

less wasteful way with _gemv or _gemm (i.e free dimension(s) ==1)

Please explain... Do you extract the diagonal from the resulting matrix somehow? How is it less wasteful?

personalityson avatar Feb 05 '17 20:02 personalityson

Diagonal? It is not square.

brada4 avatar Feb 06 '17 01:02 brada4

Not sure I get the gemm/gemv idea either, but it could be instructive to see a quick benchmark comparison of that sbmv suggestion from stackoverflow against a simple loop that the compiler can unroll as needed. As far as I could find out, the *HAD functions were a brainchild of Cray/SGI in the mid-1990s (proposed for future inclusion in LAPACK at some netlib developer meeting in 1995) and part of their SCSF library but never universally supported.

martin-frbg avatar Feb 06 '17 08:02 martin-frbg

The only way I see with gemv/gemm is to multiply two vectors v1T * v2 (of the same size), which returns a square matrix, and the diagonal is actually the hadamard product. I spent the whole weekend looking if there were some tricks with lda/incx, but I'm too dumb to find anything. Maybe there is? :)

sbmv allows the input matrix to be stored in banded matrix form. With number of super-diagonals = 0 you only need to supply the diagonal, which in our case is one of the vectors, so it's not so bad...

I should explain the context to my problem: I'm on a work computer where I do not have admin rights, and cannot install or compile anything. I use Excel VBA to declare function calls to libopenblas.dll in order to do some matrix crunching. (When I write that hadamard is "the only reason which prevents me from using OpenBLAS" -- I'm bluffing. mkl_rt.dll does not allow stdcall declarations from VBA, or maybe there is another reason, in any case I can only use OpenBLAS.) So, my two options are sbmv or a ~~C~~ ~~Macro~~ VBA macro for hadamard, and ideally I would want to have it outside of VBA...

personalityson avatar Feb 06 '17 08:02 personalityson

swap dimensions of v2 and get HAD swap dimensions of v1 and get DOT

brada4 avatar Feb 06 '17 09:02 brada4

The Hadamard product is obviously a memory bounded operation, so the only thing one had to take care of is that the matrices are not traversed orthogonal to their storage scheme. If the matrices are accesses in the right way one has 3 loads, 1 multiplication and one store. The only point I see where one can optimize something is if we assume $ C := \alpha A\circle B + \beta C $ with $\beta=0$ and using non-temporal stores for C. In this case we only have 2 loads and one store per multiplication but requires aligned access to C and C must be larger than the L3 cache of current Intel-CPUs to give good results.

grisuthedragon avatar Feb 07 '17 07:02 grisuthedragon

in common case one can treat matrices as 1:(M*N) vectors and apply marginal case of gemm / gemv

brada4 avatar Feb 07 '17 11:02 brada4

I still don't see it. Please give us a concrete example Assume size(v1) = size(v2)

personalityson avatar Feb 07 '17 18:02 personalityson

FUNCTION DHAD2(N,A,B,C) DGEMV('N',N,1,1.0,...

brada4 avatar Feb 07 '17 20:02 brada4

dgemv "N", n, 1, 1.0, v1, n, v2, 1, 1.0, res, 1 This returns v1 multiplied by the first element in v2...

personalityson avatar Feb 07 '17 21:02 personalityson

I think we can do hadamard products using xtbmv with diagonal matrix A as the following:

cblas_dtbmv(CblasColMajor, CblasUpper, CblasNoTrans, CblasNonUnit, n, 0, p, 1, q, 1);

This code does hadamard product of vectors p and q whose length are n,which writes result to q.

hiro4bbh avatar Feb 09 '17 09:02 hiro4bbh

Sorry, I recommended to use xtbmv for Hadamard product. However, according to tbmv_thread.c, using dtbmv for Hadamard product is inefficient (calling xaxpy for each elements?), so I recommend to use your own Hadamard product routine in C. Furthermore, in optimized codes on some platforms, there are mysterious bugs, and I have no idea to solve its problem, because it happens sometimes...

hiro4bbh avatar Feb 09 '17 12:02 hiro4bbh

@personalityson certainly an unexpected but interesting use case. I do now understand the merit of having the *had functions in OpenBLAS even if just for "convenience" without any particular speed benefit. Coding them as simple multiplication loops is very likely to be more efficient than "abusing" a special case of some more general function. As regards the mysterious bugs in optimized code, I trust the common goal is to get rid of them :-)

martin-frbg avatar Feb 09 '17 13:02 martin-frbg

Btw, the stdcall requirement normally needed for dlls to work in VBA only applies to 32bit dlls I saw an old feature request from 2014 also related to VBA: https://github.com/xianyi/OpenBLAS/issues/423 In any case 64bit bins work fine in 64bit version of Office

personalityson avatar Feb 09 '17 15:02 personalityson

Looking at the declaration of xHAD which has alpha and beta multipliers, incx, incy and even incz for a sparse result vector, sbmv/tbmv actually could be a good choice. The MKL-style vxMul on the other hand appears to support only the simple multiplication of N elements of vectors x and y to yield vector z.

martin-frbg avatar Feb 09 '17 22:02 martin-frbg

Hello everybody, I am looking forward this function too! My program does a video tracking task like this OpenCV tracking algorithm. The algorithm does a lot of point wise complex product in Fourier domain. The implementation in OpenCV is cv::mulSpectrum() while it is the bottle neck. My program will run on multi platform such as a x86 and ARM. OpenBLAS is my first choice from consideration of cross-platform and high performance. However, it is a pity to lack point wise vector product. Eigen and MKL do have this function.

loadwiki avatar Apr 26 '18 05:04 loadwiki

You can accelerate parts of Eigen using BLAS (OpenBLAS, MKL, Accelerate Framework) https://eigen.tuxfamily.org/dox-devel/TopicUsingBlasLapack.html The "this" function in MKL is not part of BLAS functions, but of other group, that is not provided in OpenBLAS.

brada4 avatar Apr 27 '18 11:04 brada4

Hi folks, I don't have anything to add other than to second @loadwiki , this would be a useful addition.

MaxBarraclough avatar Mar 24 '21 11:03 MaxBarraclough

It is in all viable macro libraries, memory bound, there wont be any improvement in speed adding it here over generic BLAS. I'd bet on modern compilers actually getting totally naive loop reasonably vectorised.

brada4 avatar Mar 24 '21 19:03 brada4

there wont be any improvement in speed

err, not sure I'd want to underwrite that...

martin-frbg avatar Mar 25 '21 07:03 martin-frbg

Non-destructive AXPY that is 3 memory cells per MUL in place of original 2 ....

brada4 avatar Mar 25 '21 17:03 brada4