Arraymancer Multiplication of tensors with rank

Hello!

Currently, * operator only supports matrix-matrix and matrix-vector multiplication. Are there any plans to add support for batch matrix-matrix multiplication? It would be really useful for stuff like Attention which I am trying to implement

Thanks!

Jan 07 '23 22:01 jegork

i'm interested in this as well

Mar 31 '23 14:03 lucidrains

i will be the first to submit a PR for GPT in Arraymancer, if all the pieces are available to build attention

Apr 08 '23 16:04 lucidrains

@lucidrains I actually requested this as I started writing a PR to add attention in arraymancer myself 😅, however I am a big fan of your work and it would be great if you have to possibility to add attention

Regarding this issue, I might have time to implement it, however I have little knowledge about possible implementations

May 10 '23 12:05 jegork

Sorry, I was off for a couple months and didn't check my long list of Github notifications.

batch matrix multiplication is something I've wanted to implement like 4 years ago. My main issue is that some libraries provide it:

Intel MKL via https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-batch-gemm-operations.html
Nvidia CuBLAS via https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/

but not OpenBLAS or ~~BLIS~~:

actually gemm_batch was added by AMD to BLIS on April 2022 (https://github.com/flame/blis/blob/89b7863/docs/ReleaseNotes.md?plain=1#L43-L74)
https://github.com/xianyi/OpenBLAS/issues/3039 still open

It's easy to add a naive version that for-loops over matrix multiplication, but because all the BLAS libraries use OpenMP and OpenMP doesn't support nesting properly, you can't utilize the new level of parallelism exposed at all. Which:

Led me to start designing a DSL for tensor operations https://github.com/mratsim/laser/tree/master/laser/lux_compiler/core (note: @can-lehmann, in a separate effort, went in actual working code stage with https://github.com/can-lehmann/exprgrad)
Led me to design a multithreading backend, https://github.com/mratsim/weave
Led me to design a matrix multiplication that is faster than OpenBLAS, based on Weave: https://github.com/mratsim/weave/tree/master/benchmarks/matmul_gemm_blas and the new parallelism level for batch matmul is easy to add.

Which brings to engineering issues. For now Arraymancer doesn't use a custom threadpool because it's a very involved change and I need to port some LAPACK functions as well besides just matrix multiplication, things here: https://github.com/mratsim/Arraymancer/tree/master/src/arraymancer/linear_algebra/helpers

syevr (Symmetric Recursive Eigenvalue Decomposition)
geqrf (General QR factorization)
gesdd (General Singular value Decomposition by Divide & conquer)
getrf (General Pivoted LU factorization)
gelsd (General Least Square Solver by Divide & Conquer)
gesv (General AX = B solver)

So batch matrix multiplication would be very welcome. But probably just start humble with a for loop over normal matrix multiplication.

May 12 '23 07:05 mratsim

I apologize in advance if my question is too basic.

Just out of curiosity, is there any workaround for this issue?

Let say for example, if I want to do something like, L (double dot product) P = L_ijkl P_lk. Can I convert this operation to a loop over rank 2 matrices?

Dec 08 '23 00:12 hlclemson

The einsum operator should work, but it would be non-parallelized and slow.

And otherwise you do it like im2col convolution

https://github.com/mratsim/Arraymancer/blob/12610a3a29d61c2bccb8a24fd744634c9809406d/src/arraymancer/nn_primitives/fallback/conv.nim#L99-L103

Dec 08 '23 19:12 mratsim

Multiplication of tensors with rank > 2