blis Detect and deal with mis-balancing in GEMM macro-kernel (#437)

Details:

In some multi-threading schemes, JR_NT and IR_NT may produce idle threads not performing any computation.
This commits detect such situation and implement a collapse of JR/IR loops.

Note: run git show -w <sha1> for more readability, due to new indent level by if()

Oct 20 '21 14:10 hominhquan

@devinamatthews Curious: did you ever take a look at this PR? What are your thoughts?

Jul 12 '22 03:07 fgvanzee

@fgvanzee I'll look at the changes in more detail. This was dicussed in #437 and only affects cases where BLIS_JR_NT is probably too large.

Jul 12 '22 05:07 devinamatthews

Yeah, I'm not sure I'm comfortable with this. I really think the better answer is just to not use so many threads in the JR loop. @hominhquan Is there a particular reason to set BLIS_JR_NT so high? @fgvanzee ping

Jul 12 '22 21:07 devinamatthews

@hominhquan Is there a particular reason to set BLIS_JR_NT so high?

Yes, on our MPPA processor, on which matrix tiles (A, B, C) are copied to scratchpad by DMA, memory-footprint is a big concern. We have 16 cores on each "cluster", but only spawn one simultaneous NC-loop (aka BLIS_JC_NT = 1) on the top, then going down and sharing the computation between the cores in the macro-kernel. So the macro-kernel dispatch is subjected to BLIS_JR_NT * BLIS_IR_NT = 16.

Spawning two NC-flows (BLIS_JC_NT = 2) can lighten the above constraint, but will inevitably double the local DMA scratchpad requirement, that we can't afford. (see illustration https://user-images.githubusercontent.com/1337056/178685971-7510aca3-ab44-46b2-aed3-43c25985f67b.png)

Talking more generally, I think collapsing the two JR/IR loops in the macro kernel could help reducing the loop overhead and give better load-balancing in any threading scheme, and so more "hardware-friendly". I remember (I may be wrong) having seen some commits or discussions about adding custom hand-optimized macro-kernels into BLIS ? If yes, I am curious to know the motivation behind that.

Note: In this PR I kept the original two-nested-loops and added a new collapsed one, which sadly made bigger code, but if it was my personal repo, I would keep only the collapsed version.

Jul 13 '22 08:07 hominhquan

You aren't doing any threading along the M dimension (BLIS_IC_NT)?

Jul 13 '22 16:07 devinamatthews

No, my both BLIS_JC_NT and BLIS_IC_NT are set to 1.

Only BLIS_JR_NT and BLIS_IR_NT can take any value in {1, 2, 4, 8, 16} under the condition of BLIS_JR_NT * BLIS_IR_NT = 16.

Jul 13 '22 16:07 hominhquan

Is this also a memory thing? Parallelizing along the IC loop would definitely be preferable.

Alternatively, since you are currently just collapsing the IR/JR loops, why not set IR_NT=4 and JR_NT=4?

Jul 13 '22 16:07 devinamatthews

Setting IR_NT=4 and JR_NT=4 could improve the situation , but is not the ultimate solution. Here is an example of how the two-nested-loop dispatch can be mis-balanced on edge-blocks:

M = N = 3000, MC = NC = 256, MR = 8, NR = 16. Possible edge-macro-block is (3000 % 256): 184-by-184 => m_iter = 23, n_iter = 12 (ceil(11.5))

If BLIS_JR_NT = 16, BLIS_IR_NT = 1, there will be only 12 threads working on the JR-loop, and 4 threads in standby.
If BLIS_JR_NT = 8, BLIS_IR_NT = 2, there will be enough work for the 8 JR-threads for the first batch (8 < n_iter), but the second batch there will be only 4 JR-threads working (n_iter - 8 = 4), and 4 JR-threads in standby. Idem for IR-threads (since m_iter = 23 not multiple of BLIS_IR_NT)
If BLIS_JR_NT = 4, BLIS_IR_NT = 4, the dispatch will be perfect in JR-loop (n_iter = 12 multiple of BLIS_JR_NT), but not in IR-loop where m_iter = 23 not multiple of BLIS_IR_NT. In the last batch of m_iter's (20+3), there will be 3 IR-threads working, and 1 IR-thread idle x 4 JR-threads = 4 threads in standby.

In this case, setting IR_NT=4 and JR_NT=4 can significantly improve the misbalancing (but not remove completely, we still have idle threads in some trailing iterations) for edge-blocks of size 256x184 or 184x256 (right and bottom), but not the final edge-block 184x184 in the right-bottom corner, even that it impacts only one block of 184x184.

Jul 13 '22 16:07 hominhquan

Is this also a memory thing? Parallelizing along the IC loop would definitely be preferable.

Yes, still the memory-footprint stuff. Setting IC_NT to something bigger than 1 will also increase the DMA-scratchpad requirement.

Jul 13 '22 16:07 hominhquan

@fgvanzee what might happen if the collapsed version were used all the time?

Jul 13 '22 16:07 devinamatthews

@devinamatthews For context: The only reason I asked about this PR when I did is because an industry partner asked about it during a conference call, so I wanted to be able to tell them what was going on with it. Now that I've looked at it more closely, I can say that I'm not comfortable with the current implementation. But if we can find a more elegant way of expressing the logic that doesn't involve so much code duplication, I'm open to considering it.

@fgvanzee what might happen if the collapsed version were used all the time?

I would need to study the code more before answering this.

Jul 13 '22 17:07 fgvanzee

But if we can find a more elegant way of expressing the logic that doesn't involve so much code duplication, I'm open to considering it.

This was my concern as well, and hence the question.

Jul 13 '22 18:07 devinamatthews

But if we can find a more elegant way of expressing the logic that doesn't involve so much code duplication, I'm open to considering it.

This was my concern as well, and hence the question.

May be I was way too careful when doing this PR and kept the original implementation. As I said before, if this was my code I would replace the two-nested loops by the collapsed one.

Jul 13 '22 18:07 hominhquan

But if we can find a more elegant way of expressing the logic that doesn't involve so much code duplication, I'm open to considering it.

@fgvanzee what might happen if the collapsed version were used all the time? I would need to study the code more before answering this.

@devinamatthews @fgvanzee I've looked at the code again, and saw that thrinfo_t *thread and thrinfo_t* caucus were already created in the thread-info tree. My modif of using still another temporary thrinfo_t is not optimal either.

Further, I would like to know your thoughts on the relevance of always collapsing the JR/IR loops. Is there any other macro-kernel in which collapsing JR/IR loops is bad or difficult or impossible ? (1m ? trsm ? mixed-precision ? mixed-domain ?)

If the answer is yes, then we can apply the collapsing case-by-case, where the modification is trivial (e.g. real-real gemm)
If the answer is no, and if you are convinced that collapsing is good for performance, we might apply it everywhere, and reconsider the utility of "caucus" in the thread-info tree.

Jul 13 '22 22:07 hominhquan