GEMMSUP Handler Referring GEMM Kernel Size?
Yet another issue. Excuse me. 🤣
Just happened to notice that (obj_t).blkszs sometimes affects kernel selection of gemmsup which is, I guess, supposed to be only dominated by (obj_t).l3_sup_blkszs?
These lines seems to be a plausible cause: https://github.com/flame/blis/tree/7c3eb44/frame/3/bli_l3_sup_int.c#L95-L96 https://github.com/flame/blis/tree/7c3eb44/frame/3/bli_l3_sup_int.c#L269-L270
It's usually not much impact upon performance even when we use gemmsup kernel sizes different from gemm, but it could be a bit confusing when doing some debug works.
Currently MR,NR are same between gemmsup or gemm for a given data type.
Yes! It is better to change to l3_sup api mentioned below.
const dim_t NR = bli_cntx_get_l3_sup_blksz_def_dt( dt, BLIS_NR, cntx ); const dim_t MR = bli_cntx_get_l3_sup_blksz_def_dt( dt, BLIS_MR, cntx );
At the same time it is required to create new tables and set/get functions for gemmsup, trsmsup, gemmtsup and similarly for other l3 api's so that all are independent.
Seem already implemented: https://github.com/flame/blis/tree/fab5c86/frame/base/bli_cntx.h#L336
and used: https://github.com/flame/blis/tree/fab5c86/frame/3/bli_l3_sup_var12.c#L235-L239 https://github.com/flame/blis/tree/fab5c86/frame/3/bli_l3_sup_var1n2m.c#L292-L296 https://github.com/flame/blis/tree/fab5c86/frame/3/old/bli_l3_sup_var1n2m.c#L276-L280
Maybe it would be just OK to replace the lines in bli_l3_sup_int.c with the code you presented?
Yes! if you consider just your gemm case.
But when we have sup for gemm, trsm , gemmt then it is difficult to use same l3_sup for all api's if they need different sizes, where we need a separate api's bli_cntx_get_l3_supgemm_blksz_def_dt() , bli_cntx_get_l3_suptrsm_blksz_def_dt() to be independent of each other.
We can also solve this having family id (gemm/trsm/gemmt ...etc) as an argument to the bli_cntx_get_l3_sup_blksz_def_dt to differentiate.
It would be very nice to allow using different sizes for GEMM, GEMMT and TRMM, while currently the (non-sup) block kernel handler seems to require all kernel sizes to be equal? (There is only one bli_cntx_get_blksz_def_dt.)
Of course I would also imagine SUP cases to be much more complicated thus requiring a different way to handle...