Multiplication of tensors with rank > 2
Hello!
Currently, * operator only supports matrix-matrix and matrix-vector multiplication. Are there any plans to add support for batch matrix-matrix multiplication? It would be really useful for stuff like Attention which I am trying to implement
Thanks!
i'm interested in this as well
i will be the first to submit a PR for GPT in Arraymancer, if all the pieces are available to build attention
@lucidrains I actually requested this as I started writing a PR to add attention in arraymancer myself 😅, however I am a big fan of your work and it would be great if you have to possibility to add attention
Regarding this issue, I might have time to implement it, however I have little knowledge about possible implementations
Sorry, I was off for a couple months and didn't check my long list of Github notifications.
batch matrix multiplication is something I've wanted to implement like 4 years ago. My main issue is that some libraries provide it:
- Intel MKL via https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-batch-gemm-operations.html
- Nvidia CuBLAS via https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
but not OpenBLAS or ~~BLIS~~:
- actually gemm_batch was added by AMD to BLIS on April 2022 (https://github.com/flame/blis/blob/89b7863/docs/ReleaseNotes.md?plain=1#L43-L74)
- https://github.com/xianyi/OpenBLAS/issues/3039 still open
It's easy to add a naive version that for-loops over matrix multiplication, but because all the BLAS libraries use OpenMP and OpenMP doesn't support nesting properly, you can't utilize the new level of parallelism exposed at all. Which:
- Led me to start designing a DSL for tensor operations https://github.com/mratsim/laser/tree/master/laser/lux_compiler/core (note: @can-lehmann, in a separate effort, went in actual working code stage with https://github.com/can-lehmann/exprgrad)
- Led me to design a multithreading backend, https://github.com/mratsim/weave
- Led me to design a matrix multiplication that is faster than OpenBLAS, based on Weave: https://github.com/mratsim/weave/tree/master/benchmarks/matmul_gemm_blas and the new parallelism level for batch matmul is easy to add.
Which brings to engineering issues. For now Arraymancer doesn't use a custom threadpool because it's a very involved change and I need to port some LAPACK functions as well besides just matrix multiplication, things here: https://github.com/mratsim/Arraymancer/tree/master/src/arraymancer/linear_algebra/helpers
- syevr (Symmetric Recursive Eigenvalue Decomposition)
- geqrf (General QR factorization)
- gesdd (General Singular value Decomposition by Divide & conquer)
- getrf (General Pivoted LU factorization)
- gelsd (General Least Square Solver by Divide & Conquer)
- gesv (General AX = B solver)
So batch matrix multiplication would be very welcome. But probably just start humble with a for loop over normal matrix multiplication.
I apologize in advance if my question is too basic.
Just out of curiosity, is there any workaround for this issue?
Let say for example, if I want to do something like, L (double dot product) P = L_ijkl P_lk. Can I convert this operation to a loop over rank 2 matrices?
The einsum operator should work, but it would be non-parallelized and slow.
And otherwise you do it like im2col convolution
https://github.com/mratsim/Arraymancer/blob/12610a3a29d61c2bccb8a24fd744634c9809406d/src/arraymancer/nn_primitives/fallback/conv.nim#L99-L103