nalgebra replace matrixmultiply with gemm

this PR replaces the matrixmultiply dependency with gemm, which is a crate i developed for high performance matrix multiplication. its single threaded performance is about 10-20% better than matrixmultiply for medium/large matrices. and its multithreaded performance is competitive with blis and intel mkl (and about 10% faster than eigen on my machine)

the multithread parallelism isn't implemented in this PR. since i'm not sure how to expose it in nalgebra (global variable, thread_local, per call basis?)

gemm also has a nightly feature that enables avx512 for extra performance, which is also not implemented in this PR since i don't know how we want to expose it either.

Nov 06 '22 14:11 sarah-quinones

matmul benchmark results on my machine after enabling gemm:

mat100_mul_mat100       time:   [39.299 µs 39.346 µs 39.398 µs]                               
                        change: [-4.7112% -4.5478% -4.3846%] (p = 0.00 < 0.05)
                        Performance has improved.
mat500_mul_mat500       time:   [3.8708 ms 3.8769 ms 3.8843 ms]                               
                        change: [-12.788% -12.574% -12.344%] (p = 0.00 < 0.05)
                        Performance has improved.

Nov 06 '22 14:11 sarah-quinones

the CI seems to be failing on cuda but i'm not sure what's causing it. it says it can't find aligned-vec = "0.5" but it's right there in crates.io https://crates.io/crates/aligned-vec

Nov 07 '22 09:11 sarah-quinones

Thanks @sarah-ek, this is impressive work!

I'm personally cautiously positive towards working towards integrating gemm here, given the preliminary results. There are a number of considerations to take though, and I'll outline a few of them here.

We only have two benchmarks for square matrices here. Could there be regressions for, say, non-square matrices or smaller matrices? I'm not sure how many benchmarks nalgebra already has for GEMM, but perhaps we need to extend the benchmark suite a little to account for more cases.
matrixmultiply has the benefit of being battletested through years of use in several linear algebra crates. gemm is, as far as I can tell, brand new. Given the highly unsafe nature of both crates, switching to gemm incurs no small amount of risk. Are there ways we can mitigate possible risk here?
IIRC, matrixmultiply has multiple authors, there's a published blog post on its general architecture/design (though I don't know how much things have deviated over time), and there's decent docs. This means that if there are issues related to matrix multiplication, we have more to work with. By comparison, gemm has very limited docs and it's a bit of a mystery why it's faster than matrixmultiply. This is problematic especially in case the author @sarah-ek disappears from the ecosystem (which, experience has shown, people tend to do).

I'm sure @sebcrozet might have additional thoughts on this.

the multithread parallelism isn't implemented in this PR. since i'm not sure how to expose it in nalgebra (global variable, thread_local, per call basis?)

I believe very strongly that parallelism should be opt-in, on a case-by-case basis. In particular, given nalgebra's current API, I believe A * B should always be single-threaded. This runs contrary to what Eigen does - it parallelizes based on global decisions. This is in my opinion a grave mistake and a decision that I've run afoul of several times in the past.

Automatically parallelizing A * B may make your code 10 times faster. But it might also make it 100 times slower. Yes, that's a realistic number. The problem is inherently that the user might already be running A * B inside tasks that are already run in parallel. Trying to run another 10 threads inside each already running thread will cause nothing but trouble. In short, parallelism simply doesn't compose well in a black-box setting, and the decision should be left to the user, since it is not possible to know if parallelism is beneficial or not from the context of a single GEMM call.

This begs the question how the API for parallel kernel calls - like GEMM - should look like in nalgebra.

Nov 07 '22 09:11 Andlon

I forgot to mention, but having more contributors run the benchmarks on their computers would also be helpful, since these low-level kernels might be highly sensitive to the hardware, and I'm sure you've tested it most extensively on your own hardware @sarah-ek. I'm out of time right now but I'll try to do it in due time.

Nov 07 '22 09:11 Andlon

so, gemm is based on the BLIS papers, which can be found here https://github.com/flame/blis#citations it's more or less a faithful implementation of the algorithms described in the fourth paper.

i'm aware the documentation is sparse at the moment, and i'm willing to put in the work to improve it. i have a more extensive benchmark suite on the gemm repository, and i can post the results on my machine. i can also make it so that it's easier for other people to benchmark the code. i'm also curious how it performs on machines other than my desktop and laptop (both x86_64), and i'm open to suggestions regarding that

as for the unsafety of the code. i run a subset of my tests in miri to make sure there's no UB. i'm also willing to write more tests and make sure all the code paths are covered.

Nov 07 '22 09:11 sarah-quinones

You have done an incredible job @sarah-ek, thank you for this PR!

I agree with @Andlon’s remarks.

Looking at its code base, we can’t use gemm as-is for such a core component of nalgebra. gemm needs to have more apparent documentation and tests. It also needs a CI that runs the tests. The Readme shouldn’t describe gemm as a "Playground"; this gives the feeling that it is more an experimental project than something ready for production.

I think the transition should be slower than just replacing matrixmultiply by gemm. At first, we need to keep matrixmultiply as the default backend, and only enable gemm if the user asks for it explicitly. I suggest that we add an experimental-use-gemm-instead-of-matrixmultiply feature which overwrites matrixmultiply by gemm if enabled. You can see this as an easy way for the community try it out and see if there is any bug or performance benefit on real-world applications.

the CI seems to be failing on cuda but i'm not sure what's causing it. it says it can't find aligned-vec = "0.5" but it's right there in crates.io https://crates.io/crates/aligned-vec

I think this happens because the compiler version nightly-2021-12-04 (needed by rust-cuda) does not support the dep: prefix in serde = ["dep:serde"]. So cargo just skips the version 0.5 of aligned-vec because it considers its Cargo.toml to be invalid. Unfortunately, I don’t know any solution that could make it work, other than removing the dep: prefix.

Nov 07 '22 10:11 sebcrozet

these are all great ideas. im not familiar with how CI works but i can use this as a learning opportunity. having it as an experimental feature also sounds good to me. as for the dep: issue, i can take care of it

Nov 07 '22 10:11 sarah-quinones