blis Disable automatic parallelism in the JR loop?

When setting parallelism the "automatic way" as described in the Multithreading documentation, the caller defers to BLIS to factorize the total number of threads into the number of ways of parallelism for each loop. For example, if the user sets BLIS_NUM_THREADS to 8, internal logic encoded in bli_rntm_set_ways_from_rntm() (located in frame/base/bli_rntm.c) will attempt to factorize this 8 ways of parallelism between the m and n dimensions. It begins by assigning all parallelism to the jc and ic loops, and then offloads some of each to jr and ir, depending on the values of BLIS_DEFAULT_NR_THREAD_MAX and BLIS_DEFAULT_MR_THREAD_MAX, each of which specify the maximum ways of parallelism for the jr and ir loops, respectively. (Those variables should be renamed, btw.) In this example, assuming the problem size was square, BLIS would end up using ic_nt = 4 and jc_nt = 2.

This issue concerns the default values for BLIS_DEFAULT_NR_THREAD_MAX and BLIS_DEFAULT_MR_THREAD_MAX, as specified in frame/include/bli_kernel_macro_defs.h:

#ifndef BLIS_DEFAULT_MR_THREAD_MAX
#define BLIS_DEFAULT_MR_THREAD_MAX 1
#endif

#ifndef BLIS_DEFAULT_NR_THREAD_MAX
#define BLIS_DEFAULT_NR_THREAD_MAX 4
#endif

The default value of 1 for BLIS_DEFAULT_MR_THREAD_MAX makes sense because the ir loop is seldom parallelized, since it only makes sense when the L1 data cache is shared among some group of cores.

However, I think BLIS_DEFAULT_NR_THREAD_MAX should be also be set to 1. The reason is as follows: generally speaking, extracting parallelism from the jr loop is prescribed only when the L2 cache is shared among some group of cores. The most prevalent microarchitectures we use today are Haswell/Broadwell/Skylake/Kabylake, all of which allocate a private L2 to each core. And while there are examples of other systems where the L2 cache is shared, those cases are not widespread, and they can always be handled by overriding the defaults in the corresponding bli_family_<arch>.h file within the configuration.

So, my proposed change is simple:

#ifndef BLIS_DEFAULT_MR_THREAD_MAX
#define BLIS_DEFAULT_MR_THREAD_MAX 1
#endif

#ifndef BLIS_DEFAULT_NR_THREAD_MAX
#define BLIS_DEFAULT_NR_THREAD_MAX 1
#endif

Thus, unless overridden by the configuration, this would force all automatically obtained parallelism to the jc and ic loops.

Oct 09 '18 21:10 fgvanzee

It still makes sense to have JR_NT > 1 with private L2 caches in some (many?) cases. The increase in cache coherency traffic can be offset by the savings of sharing B.

Although, based on early experiments with setting up the algorithm to prefer m >= n instead of matching the microkernel preference (i.e. "ambidextrous" microkernels) show that JR_NT > 1 may be unnecessary in this case.