blis Mis-balanced work between JR/IR threads in edge-macro-blocks

Let call macro-block an MC-by-NC block, which is traditionally processed by a call to macro-kernel Let call edge-macro-block either an ME-by-NC or an MC-by-NE block, where 0 < ME < MC and 0 < NE < NC (E stands for edge). In short, edge-macro-blocks occur when the C matrix size (M or N) is not multiple of MC (or NC).

I observed the mis-balancing in my old version of BLIS on edge-macro-blocks . After checking with the latest version of BLIS, even with the slab or round-robin dispatch, I think the issue still remains when all following conditions are met:

Edge-macro-block occurs, where n_iter (or m_iter) in the macro-kernel will be smaller than the normal NC/NR (or MC/MR).
User set BLIS_JR_NT > 1, for example BLIS_JR_NT = NC/NR (one n_iter per jr_thread) or BLIS_JR_NT = NC/(2*NR) (two n_iter per jr_thread). Idem for BLIS_IR_NT > 1, for example BLIS_IR_NT = MC/MR (one m_iter per ir_thread) and so on.
User tuned all BLIS_*_NT so that their product is equal to the number of physical cores.

If these conditions are true, we can have for example an edge-macro-block with n_iter smaller than the user-defined BLIS_JR_NT. Since all those threads are spawned, the "trailing" JR threads will standby looking at a few other JR threads working on edge-n_iter blocks. The situation is also the same for the IR-loop. And the slab/rr dispatch can not help in this case, since the work is dispatched separately between JR-loop and IR-loop. As a result, we can have up to half of cores (or even worse) not doing any computation.

One solution I can see is to detect this edge case and to dynamically "fusion" the two loops into a flatten 1D workspace of c11 micro-blocks and dispatch them equally on jr_nt * ir_nt threads

For instance, I did this on the gemm macro-kernel in my BLIS version 0.2.2 and it fixes the mis-balancing issue:

	/* fused num_threads in JR and IR */ 
	dim_t num_threads = jr_num_threads * ir_num_threads; 
	
	/* fused number of c11 micro-blocks */
	dim_t num_blocks = m_iter * n_iter; 
	dim_t num_blocks_per_thread = num_blocks / num_threads; 

	/* possible trailing c11 blocks */
	dim_t num_blocks_trailing = num_blocks % num_threads;

	/* my thread id in the fused thread-domain */
	dim_t ithread_ji = ir_thread_id * jr_num_threads + jr_thread_id; 

	/* index of my first c11 block in a flatten 1D range of blocks */
	dim_t iblock_begin = ithread_ji * num_blocks_per_thread; 
	/* nudge my first block in case of num_blocks_trailing */
	iblock_begin += bli_min(ithread_ji, num_blocks_trailing); 

	/* index of my last c11 block in a flatten 1D range of blocks */
	dim_t iblock_end = iblock_begin + num_blocks_per_thread; 
	/* nudge my last block in case of num_blocks_trailing */
	iblock_end += ((ithread_ji < num_blocks_trailing) ? 1 : 0); 

	/* Next panels */ 
	dim_t jnext = iblock_begin % n_iter; 
	dim_t inext = iblock_begin / n_iter; 
	ctype* restrict a2 = a_cast + inext * rstep_a; 
	ctype* restrict b2 = b_cast + jnext * cstep_b; 

	/* Flatten 1D loop of fused JR/IR loops */ 
	for (dim_t iblock = iblock_begin; iblock < iblock_end; iblock++) 
	{
		/* copy next to current and compute next panels */
		j = jnext; 
		i = inext; 
		a1 = a2; 
		b1 = b2; 

		/* Next panels */ 
		jnext = (iblock + 1) % n_iter; 
		inext = (iblock + 1) / n_iter; 
		a2 = a_cast + inext * rstep_a; 
		b2 = b_cast + jnext * cstep_b; 

		... 
	}

Aug 12 '20 14:08 hominhquan

@hominhquan thanks for your analysis. In practice, BLIS_IR_NT is always 1, as threading this loop just doesn't make sense on any architecture we've seen. Without diving into the details on this, can you comment whether an issue still exists if we assume BLIS_IR_NT == 1?

Aug 12 '20 14:08 devinamatthews

I should also add that realistically BLIS_JR_NT <= 4.

Aug 12 '20 14:08 devinamatthews

@hominhquan thanks for your analysis. In practice, BLIS_IR_NT is always 1, as threading this loop just doesn't make sense on any architecture we've seen. Without diving into the details on this, can you comment whether an issue still exists if we assume BLIS_IR_NT == 1?

@devinamatthews Yes, I observed indeed the issue on the JR loop with JR_NT bigger than edge-n_iter and IR_NT = 1. The IR-loop was just a generalized case from JR-loop.

Aug 12 '20 14:08 hominhquan

OK. Do you have any performance numbers? It sounds like we can improve general parallel performance then.

Aug 12 '20 14:08 devinamatthews

I give an example of mis-balancing : MC = NC = 256, MR = 8, NR = 16, M = N = 3000 Possible edge-macro-block is (3000 % 256): 184-by-184 => m_iter = 23, n_iter = 12 (ceil(11.5)) If I set BLIS_JR_NT = 16, there will be only 12 threads working on the JR-loops, and 4 threads standby.

Aug 12 '20 14:08 hominhquan

OK. Do you have any performance numbers? It sounds like we can improve general parallel performance then.

For example, on the Kalray MPPA3 processor, I get a speedup x1.3 on the macro-kernel itself. the performance gain depends closely on the matrix size and MC/NC parameters. It is hard to assess a number-ed gain on an architecture X, but I believe it can improve the overall parallel scheme. That's why I opened this ticket.

Aug 12 '20 14:08 hominhquan

In fact, we can re-use the current slab/rr dispatch, but on the fused workspace:

/* construct a fused JR/IR thread_info */
thrinfo_t thread_jrir = ... ;
bli_thread_range_jrir( &thread_jrir, n_iter * m_iter, 1, FALSE, &jrir_start, &jrir_end, &jrir_inc ); 

/* fused loop */
for ( ... )
{
   ... 
}

Aug 12 '20 14:08 hominhquan

@hominhquan is the general issue that BLIS_JR_NT should be a divisor of BLIS_NC/BLIS_NR? As I mentioned before, BLIS_JR_NT also shouldn't be very large, maybe 4 at most.

Mar 14 '22 15:03 devinamatthews

@devinamatthews Yes, but the full condition should be:

BLIS_JR_NT ideally be divisor of BLIS_NC/BLIS_NR, and
BLIS_IR_NT ideally be divisor of BLIS_MC/BLIS_MR

BLIS_JR_NT also shouldn't be very large, maybe 4 at most.

Unless BLIS implicitly re-tweaks user's BLIS_JR_NT and BLIS_IR_NT to respect your mentioned condition, a user lamda is free to set these values to his convenience, including using global env var or local rntm_t ?

And I remind that this issue is not manifested in all cases. I reput here an example :

MC = NC = 256, MR = 8, NR = 16, M = N = 3000
Possible edge-macro-block is (3000 % 256): 184-by-184 => m_iter = 23, n_iter = 12 (ceil(11.5))
If  BLIS_JR_NT = 16, there will be only 12 threads working on the JR-loops, and 4 threads standby.
The same for the IR-loop.

Mar 14 '22 15:03 hominhquan

No BLIS can't override the user's choices, but I guess then it's up to the user to not do that. I just added some notes to the Multithreading documentation.

Mar 14 '22 15:03 devinamatthews

I'm a bit confused as to what this issue is really about. And to the extent that I understand it, I'm not quite sure why this issue was labeled as a bug.

I'm not sure if this is causing any confusion, but let's all remember that matrix dimensions are first partitioned for parallelism (oblivious to cache blocksizes), and then each thread (or thread group) applies its cache blocksizes to whatever portion of the partitioned dimension it is assigned. (And if there is not enough data along a dimension so that all threads get at least one micropanel of work, then some threads will idle.)

Mar 14 '22 20:03 fgvanzee

@fgvanzee You are right, this is, functionality-speaking, not a bug, but a sub-optimiality in micro-kernels dispatch. As you said, it is possible to have idle threads not doing any computation.

Mar 15 '22 08:03 hominhquan

@hominhquan just had in interesting conversation where some cases were pointed out where BLIS_IR_NT > 1 make sense performance-wise. Does BLIS_IR_NT = BLIS_JR_NT = 4 give you reasonable performance and no idling threads?

Jul 22 '22 20:07 devinamatthews