Disable automatic parallelism in the JR loop?
When setting parallelism the "automatic way" as described in the Multithreading documentation, the caller defers to BLIS to factorize the total number of threads into the number of ways of parallelism for each loop. For example, if the user sets BLIS_NUM_THREADS to 8, internal logic encoded in bli_rntm_set_ways_from_rntm() (located in frame/base/bli_rntm.c) will attempt to factorize this 8 ways of parallelism between the m and n dimensions. It begins by assigning all parallelism to the jc and ic loops, and then offloads some of each to jr and ir, depending on the values of BLIS_DEFAULT_NR_THREAD_MAX and BLIS_DEFAULT_MR_THREAD_MAX, each of which specify the maximum ways of parallelism for the jr and ir loops, respectively. (Those variables should be renamed, btw.) In this example, assuming the problem size was square, BLIS would end up using ic_nt = 4 and jc_nt = 2.
This issue concerns the default values for BLIS_DEFAULT_NR_THREAD_MAX and BLIS_DEFAULT_MR_THREAD_MAX, as specified in frame/include/bli_kernel_macro_defs.h:
#ifndef BLIS_DEFAULT_MR_THREAD_MAX
#define BLIS_DEFAULT_MR_THREAD_MAX 1
#endif
#ifndef BLIS_DEFAULT_NR_THREAD_MAX
#define BLIS_DEFAULT_NR_THREAD_MAX 4
#endif
The default value of 1 for BLIS_DEFAULT_MR_THREAD_MAX makes sense because the ir loop is seldom parallelized, since it only makes sense when the L1 data cache is shared among some group of cores.
However, I think BLIS_DEFAULT_NR_THREAD_MAX should be also be set to 1. The reason is as follows: generally speaking, extracting parallelism from the jr loop is prescribed only when the L2 cache is shared among some group of cores. The most prevalent microarchitectures we use today are Haswell/Broadwell/Skylake/Kabylake, all of which allocate a private L2 to each core. And while there are examples of other systems where the L2 cache is shared, those cases are not widespread, and they can always be handled by overriding the defaults in the corresponding bli_family_<arch>.h file within the configuration.
So, my proposed change is simple:
#ifndef BLIS_DEFAULT_MR_THREAD_MAX
#define BLIS_DEFAULT_MR_THREAD_MAX 1
#endif
#ifndef BLIS_DEFAULT_NR_THREAD_MAX
#define BLIS_DEFAULT_NR_THREAD_MAX 1
#endif
Thus, unless overridden by the configuration, this would force all automatically obtained parallelism to the jc and ic loops.
It still makes sense to have JR_NT > 1 with private L2 caches in some (many?) cases. The increase in cache coherency traffic can be offset by the savings of sharing B.
Although, based on early experiments with setting up the algorithm to prefer m >= n instead of matching the microkernel preference (i.e. "ambidextrous" microkernels) show that JR_NT > 1 may be unnecessary in this case.
The increase in cache coherency traffic can be offset by the savings of sharing B.
Can you elaborate on the savings you're referring to here? The regime I envision would still involve ic_nt threads sharing each panel of B.
You always get IC_NT independent panels of B...right?
Er, I mean 'JC_NT'. But the amount of sharing is less.
Let's assume ic_nt = 4 and jc_nt = 2. This results in two packed panels of B being created. Each panel of B would be shared across 4 threads.
Right, instead of one panel shared across 8 threads if we had jr_nt = 2.
I guess what I don't quite follow is this perceived benefit of threads "sharing" B.
I am badly explaining what was told to me by @tlrmchlsmth some time ago. But the benchmark is the ultimate authority.
FWIW, @dnparikh already has performance data for trsm on ThunderX2 (which has a private L2 cache) that shows a big difference between pushing all parallelism to the jr loop vs. splitting it between the jc and jr loops. (BLIS does not yet support ic parallelism in trsm, so we haven't been able to look at fiddling with ic_nt yet.)
I'll see if I can see any difference on my four-core Haswell.
Yes I imagine relying on cache coherency performance on ARM is a problem.
Yes I imagine relying on cache coherency performance on ARM is a problem.
Can you elaborate?
Intel has very fast communication between private caches through the ring buffer (mesh network on newer chips). I highly doubt ARM can hold a candle to it.
I couldn't gather any meaningful inferences from the data I collected on my Haswell workstation. (Not enough cores to play with.) Maybe I'm willing to punt on this issue for now.
@dnparikh I think we can achieve what we discussed for trsm (splitting parallelism between jc and jr loops) without this issue moving forward. Instead, we can build that logic directly into bli_rntm_set_ways_for_op().
Also note that these settings are (or are supposed to be) per-configuration.
I understand. This issue was always about changing the default values to values that would work as the best starting point.