futhark-benchmarks GEMM benchmark with CLBlast comparison

Hello team,

I read with interests the 2 papers you published on Futhark and the DNN implementation.

I'm currently researching autotuners for deep learning and futhark stands out with the introduction of Second-Order Array Combinators like scanomap and redomap.

I'd like to understand how Futhark fares versus state-of-the-art matrix multiplication which should be able to reach 90%+ of hardware utilisation and stress compilers, autotuners and handwritten kernels with need of cache-aware (or cache-oblivious) algorithms, tiling, register pressure.

However I cannot find a benchmark versus a reference library like ClBlast

Jul 13 '19 09:07 mratsim

Futhark does not fare particularly well against ClBlast (at least I'd not expect that it does). This is based on only a little data: for the PPoPP'19 experiments we implemented matrix multiplication in Futhark and with cuBLAS. As I recall, cuBLAS was at least two times faster in its comfort zone.

We pretty much know why this is: Futhark only does block tiling in local memory, while cuBLAS (probably) also does register tiling. We verified this by manually applying register tiling to Futhark's generated code, which brought performance fairly close to cuBLAS. We roughly know how to implement register tiling, but it's a bunch of work, so we haven't gotten around to it yet. It's unlikely (and not really intended) that Futhark will ever outperform expertly hand-tuned code on a primitive-for-primitive level.

Jul 13 '19 10:07 athas

You can beat hand-tuned or at least reach 90% of the speed with a high-level library:

Nvidia CUTLASS shows the high-level primitives necessary to reach handwritten CuBLAS performance
Halide allows us to use a high-level einsum-like DSL for the algorithm with separate loop schedule tuning.
Tiramisu similarly allows schedule tuning but is polyhedral based.

Even for CPU, I personally managed to reach BLAS speed without any assembly on CPU matrix multiplication:

I have a high-level "low-level kernel" generator,
wrapped into tiling loops
with perf reaching OpenBLAS and Intel-tuned libraries.

Obviously that does not compose well and also requires the same effort to reproduce in OpenCL or Cuda, so I'm looking into alternative representations that abstracts loops hence my inquiries into Futhark.

Jul 15 '19 08:07 mratsim

Sure, but all of those still involve machine-specific hand-tuning/scheduling (or search procedures). We don't have any current plans for Futhark to directly support that level of manual optimisation, but maybe if we can figure out a way that does not complicate the language as a whole... The Halide/Tiramisu approaches are definitely inspiring.

Currently, our main performance focus is on optimising composition and ensuring good application-level performance. Eventually it'd be fun to also directly compete at the level of individual primitives, but alas, time is finite!

Jul 15 '19 09:07 athas