Add possibility to reduce pool-block size upon micro-kernel behavior
-
Currently, MC (or NC) size of of pool-block is extended by max(MR, NR) to cover possible invalid prefetch (speculative) of micro-kernels, typically for their convenience in the last KC-iteration.
-
This commit adds a macro BLIS_UKERNELS_NO_SPECULATIVE_PREFETCH to leave a possibility to any architecture which, in exchange of carefull prefetching in their micro-kernels to avoid invalid loads, to save some memory space.
@hominhquan I agree with the idea, but I think it might be simpler, clearer, and more flexible to instead create a macro that encodes how much to extend MC/NC when allocating, e.g. BLIS_GEMM_PANEL_PADDING or something. If not set, it could default to max(MR,NR). Then ukernels could set it to 0 or any other suitable value. @fgvanzee does this value already have a name internally?
Aside: "prefetch" isn't the right word because the prefetch is harmless, it's the speculative load that is the problem.
@devinamatthews thanks for your comment, and okay to a yet-to-be-named BLIS_GEMM_PANEL_PADDING macro which is even more generic.
@hominhquan I agree with the idea, but I think it might be simpler, clearer, and more flexible to instead create a macro that encodes how much to extend MC/NC when allocating, e.g. BLIS_GEMM_PANEL_PADDING or something. If not set, it could default to max(MR,NR). Then ukernels could set it to 0 or any other suitable value. @fgvanzee does this value already have a name internally?
@devinamatthews Are you asking if I have coined terminology for this additional micropanel of space reserved in the packing blocks for A and B? Or are you asking if I've already parameterized this within the build system / cpp macros?
Either, but especially the latter.
Either, but especially the latter.
I guess I don't have a name for it yet. (I never expected there would be any need or desire to isolate it for disabling.)
but I think it might be simpler, clearer, and more flexible to instead create a macro that encodes how much to extend MC/NC when allocating, e.g. BLIS_GEMM_PANEL_PADDING or something. If not set, it could default to max(MR,NR).
In principle, I sympathize with wanting to make this a cpp macro parameter that can be set in a header file in the subconfiguration directory.
The problem I see with this is that the extra micropanel's size depends on the register blocksizes, which are set on a per-datatype basis. So you couldn't nicely capture the absolute size of padding for all datatypes with one number; you'd need four constants--one for each datatype.
One could imagine, however, a single constant--probably a double--that determines the size of padding relative to the size of one micropanel. We would have to establish clear rules how that micropanel size would be defined. For example, currently that micropanel is basically max( packmr, packnr ) * kc * dt in size. So maybe that would be our fundamental unit size.
This would allow @hominhquan to zero out the constant, but allow others to increase it. The only limitations are that (1) they'd have to express the constant in units of micropanels, and (2) the constant would not vary across datatypes (although the number of bytes would naturally vary given that the register blocksizes vary).
As an aside, I see some code that I could probably clean up and simplify. BLIS has not used right-side trsm macrokernels in a long time, and I don't think we will go back to that regime. If I can commit to that, then there is no need for computing the "scaling factors" that facilitate swapping of MR and NR.
4 macros seems reasonable as ukernels for different datatypes may have different behavior.
4 macros seems reasonable as ukernels for different datatypes may have different behavior.
I'm really trying to move BLIS away from situations where we have to explicitly handle different datatypes in the code because it adds yet another place that requires tending-to and maintenance when adding support for new datatypes.