matrixmultiply Use optimal kernel parameters (architectures, matrix layouts)

I am trying to figure out what to use as optimal kernel parameter for different architectures.

For example, it looks like blis is using 8x4 for Sandy Bridge, but 8x6 for Haswell. Why? What lead them to this setup? Specifically, because operations are usually on 4 doubles at a time, how does the 6 fit in there. Is Haswell able to separately execute a _mm256 and a _mm operation at the same time?

Furthermore, if we have non-square kernels like for dgemm, is there a scenario where choosing 4x8 over 8x4 is better?

Dec 03 '18 14:12 SuperFluffy

You must also include src/archparam.rs in this

Dec 03 '18 17:12 bluss

This discussion is revealing in terms of how to determine optimal kernel parameters: https://github.com/flame/blis/issues/253

In particular, this states:

@VirtualEarth Turn your attention to Eq. 1:
  mr nr >= Nvec Lvfma Nvfma
On Intel Haswell/Broadwell/Skylake/Kabylake, Nvec is 4, Lvfma is 5, and Nvfma is 2. Thus, the register blocksize product mr nr must be at least 40 in order to overcome the floating-point latency. 6 x 8 = 48 more than satisfies this, but 8 x 6 would also be fine. The former is biased toward row-oriented SIMD output while the latter is better for column-oriented SIMD output. In BLIS, it almost never actually matters which one you pick because high-level logic will transpose the entire operation and make the output matrix C appear to be row- or column-stored, depending on the SIMD output "preference" of the microkernel.

Another interesting bit is the choice of 8x6 over 6x8 (or 8x4 over 4x8 for Sandy Bridge, which is our current implementation), which prefers column- vs row-storage in the C matrix. This then ties in with my question here: https://github.com/bluss/matrixmultiply/issues/31

Dec 04 '18 10:12 SuperFluffy

Hello there,

I'm the author of Arraymancer a Numpy + machine learning + deep learning written from scratch in Nim.

In the past month I've been building a new faster backend, with a specific focus on matrix multiplication as MKL, BLAS, BLIS were limiting my optimisation opportunities (like fusing neural network activations into GEMM or merging the im2col pass for convolution into matrix multiplication).

I've done extensive review of the literature here and added a lot of comments in my tiling code.

The most useful papers are:

[1] Anatomy of High-Performance Matrix Multiplication (Revised) Kazushige Goto, Robert A. Van de Geijn - http://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf

[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication Smith et al - http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

[3] Automating the Last-Mile for High Performance Dense Linear Algebra Veras et al - https://arxiv.org/pdf/1611.08035.pdf

[4] GEMM: From Pure C to SSE Optimized Micro Kernels Michael Lehn - http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html

I also keep some extra links that I didn't have time to sort.

Anyway, in terms of performance I have a generic multi-threaded BLAS (float32, float64, int32, int64) that reaches between 97~102% of OpenBLAS on my Broadwell (AVX2 and FMA) laptop depending if multithreaded/serial/float32 or float64:

Kernel parameters

mr * nr: 6 * 16.
- Why 16?: one-dimension must be a multiple of the vector size. With AVX it's 8. Also since CPU can issue 2 Fused-Multiply-Add in parallel (instruction-level parallelism) 16 is an easy choice.
- Why 6?: We have 16 general purposes register on SSE2~AVX2. We need to keep a pointer to packed A, packed B and the loop index over kc. So with 6 I can unroll twice, using 12 registers, and have 4 left for bookeeping.
- Why not 16x6?: C will be a MxN matrix, and very often C is a contiguous matrix that has been allocated. As by default I use row-major, 6x16 avoids transposing during the C update step and allow me to specialize when C as a unit stride along the N dimension.

Panel parameters

Proper tuning of mc and kc is very important as well. Currently I use an arbitrary 768 bytes for mc and 2048 bytes for kc. In my testing I lost up to 35% performance with other values.

There are various constraint for both, the Goto paper goes quite in-depth into them:

micropanel of packed B of size kc * nr should remain in L1 cache, and so take less than half of it to not be evicted by other panels.
Panel of packed A of size kc * mc should take a considerable part of the L2 cache, but still stay addressable by the TLB. I had an algorithm to choose them, but as I can only query cache but no TLB information at the moment, I removed it and decided to tune manually.

Dec 08 '18 00:12 mratsim

I forgot to add. As mentioned in the paper Automating the last mile. You need to choose your SIMD for C updates.

You can go for shuffle/permute or broadcast and balance them to a varying degree.

To find the best you need to check from which port those instructions can be issued. Also interleave them with FMA to hide data fetching latencies.

In my code I use full broadcast and no shuffle/permute but mainly because it was simpler to reason about, I didn't test other config.

Dec 08 '18 00:12 mratsim

@mratsim Wow, cool to hear from you! Thanks for the links to the papers and for sharing your knowledge!

Dec 08 '18 08:12 bluss

Issue #59 allows tweaking the NC, MC, KC variables easily at compile-time which is one small step and a model for further compile time tweakability.

Nov 13 '21 14:11 bluss

Another idea which libsxmm uses is an autotuner, such as https://opentuner.org/. OpenTuner automatically evaluates the best parameters for the architecture the code is compiled on.

Nov 25 '21 22:11 emmatyping