blis icon indicating copy to clipboard operation
blis copied to clipboard

Proof-of-concept: speeding up gemm reference kernel

Open bartoldeman opened this issue 10 months ago • 4 comments

Related issue: https://github.com/flame/blis/issues/259

This proof of concept is the result of me playing around a bit with reference kernels to better understand the underlying algorithms used for GEMM by BLIS and OpenBLAS for an upcoming talk ( https://easybuild.io/eum25/#linalg ): with blislab I got close to peak performance with just a kernel written in C.

This is a proof-of-concept since I'm not quite sure how to integrate parts of it, particularly the prefetch stuff, and I may have abused the C preprocessor a bit too much, although other things may be straight forward.

So here's the idea: via a macro, generate 4 fast kernels: row-major/column-major and beta==0/beta!=0

Then the for loop for k was replaced by a do-while loop, so it only works with k>0 (checked before). Some 20 iterations before the end, much like various asm kernels, it'll prefetch relevant parts of C; I did not see any benefit prefetching A and B. Next I also needed to fold the scaling into the c updater, replacing bli_tcopys with bli_tscal2s and bli_txpbys by bli_taxpbys.

I found that if I use a for loop instead of do-while or test for beta==0 inside the kernel the compiler spills the whole C-tile from registers onto the stack, but it'll keep it in registers with this approach.

Some tests on zen4 (single socket AMD EPYC 9534 64-Core Processor, Genoa) with GCC 13.3, CFLAGS="-march=native" ./configure generic on a 2400x2400 single-threaded dgemm:

  • original generic: 36.13 Gflops
  • generic with this PR: 44.96 Gflops
  • using column-major with KC,MC,NC copied from AOCL-BLAS' zen4 config: 57.67 Gflops (*)
  • AOCL-BLAS 5.0, pre-compiled GCC binary: 56.77 Gflops

(it was cool to beat AOCL-BLAS by a small amount, although of course there may be other cases where it won't!)

(*) this used CFLAGS="-march=native -DBLIS_MR_d=32 -DBLIS_NR_d=6" ./configure generic and the following change:

--- a/ref_kernels/bli_cntx_ref.c
+++ b/ref_kernels/bli_cntx_ref.c
@@ -379,8 +379,8 @@ void GENBARNAME(cntx_init)
        bli_blksz_init     ( &blkszs[ BLIS_NR  ],     BLIS_NR_s,     BLIS_NR_d,     BLIS_NR_c,     BLIS_NR_z,
                                                  BLIS_PACKNR_s, BLIS_PACKNR_d, BLIS_PACKNR_c, BLIS_PACKNR_z );
        bli_blksz_init_easy( &blkszs[ BLIS_MC  ],           256,           128,           128,            64 );
-       bli_blksz_init_easy( &blkszs[ BLIS_KC  ],           256,           256,           256,           256 );
-       bli_blksz_init_easy( &blkszs[ BLIS_NC  ],          4096,          4096,          4096,          4096 );
+       bli_blksz_init_easy( &blkszs[ BLIS_KC  ],           256,           512,           256,           256 );
+       bli_blksz_init_easy( &blkszs[ BLIS_NC  ],          4096,          4002,          4096,          4096 );
        bli_blksz_init_easy( &blkszs[ BLIS_M2  ],          1000,          1000,          1000,          1000 );
        bli_blksz_init_easy( &blkszs[ BLIS_N2  ],          1000,          1000,          1000,          1000 );
        bli_blksz_init_easy( &blkszs[ BLIS_AF  ],             8,             8,             8,             8 );
@@ -447,7 +447,7 @@ void GENBARNAME(cntx_init)
        gen_func_init_ro( &funcs[ bli_ker_idx( BLIS_GEMMTRSM1M_U_UKR ) ], gemmtrsm1m_u_ukr_name );

        //                                                           s      d      c      z
-       bli_mbool_init( &mbools[ BLIS_GEMM_UKR_ROW_PREF ],        TRUE,  TRUE,  TRUE,  TRUE );
+       bli_mbool_init( &mbools[ BLIS_GEMM_UKR_ROW_PREF ],        TRUE, FALSE,  TRUE,  TRUE );
        bli_mbool_init( &mbools[ BLIS_GEMMTRSM_L_UKR_ROW_PREF ], FALSE, FALSE, FALSE, FALSE );
        bli_mbool_init( &mbools[ BLIS_GEMMTRSM_U_UKR_ROW_PREF ], FALSE, FALSE, FALSE, FALSE );
        bli_mbool_init( &mbools[ BLIS_TRSM_L_UKR_ROW_PREF ],     FALSE, FALSE, FALSE, FALSE );
@@ -552,4 +552,3 @@ void GENBARNAME(cntx_init)
        for ( dim_t i = 0; i < BLIS_NUM_LEVEL3_OPS; i++ )
                bli_cntx_set_l3_sup_handler( i, vfuncs[ i ], cntx );
 }

bartoldeman avatar Mar 09 '25 17:03 bartoldeman

Thank you @bartoldeman for your contribution. I added some comments in review. Feel free to reply back.

hominhquan avatar Mar 10 '25 12:03 hominhquan

@bartoldeman I haven't done as thorough a review as @hominhquan but I especially like that you were able to find a way to convince the compiler to keep the AB microtile in registers. This is something I had struggled with a lot and was the biggest deficiency compared to the hand-written kernels. (Except for icc which did some very strange things in the loop body, but it's deprecated now...)

Thinking of integration into BLIS, what would be really neat is if as much as possible about the kernel was configurable via macros, e.g. number of iterations before to prefetch C, row-major vs. column-major, etc. Then, if compilers tend to play nice in a portable way (we'd need to look at this), then it would be an excellent starting point for new architectures.

devinamatthews avatar Mar 10 '25 15:03 devinamatthews

@devinamatthews I created a zen4 config using this reference kernel but it does mean adding it in various places through the general source code as well, not simply adding a few files under config/zen4. That said, bli_kernel_defs_<arch>.h could be a place to define CACHELINE_SIZE and TAIL_NITER perhaps. Right now I can already pass MR and NR to configure but would it be a good idea to be able to do that for some other constants too, to easily create a tuned generic kernel, without going more "heavy duty"? This is why it's POC.. I really am not familiar enough how to do that yet.

bartoldeman avatar Mar 10 '25 16:03 bartoldeman

Yes just those two settings would be a great place to start. We'd just want to give them "BLISier" names and write some documentation. We'd want to find somebody to test on other architectures as well (I can only do Zen3, SKX, and Apple M1).

devinamatthews avatar Mar 10 '25 16:03 devinamatthews