blis icon indicating copy to clipboard operation
blis copied to clipboard

Disable automatic parallelism in the JR loop?

Open fgvanzee opened this issue 7 years ago • 15 comments

When setting parallelism the "automatic way" as described in the Multithreading documentation, the caller defers to BLIS to factorize the total number of threads into the number of ways of parallelism for each loop. For example, if the user sets BLIS_NUM_THREADS to 8, internal logic encoded in bli_rntm_set_ways_from_rntm() (located in frame/base/bli_rntm.c) will attempt to factorize this 8 ways of parallelism between the m and n dimensions. It begins by assigning all parallelism to the jc and ic loops, and then offloads some of each to jr and ir, depending on the values of BLIS_DEFAULT_NR_THREAD_MAX and BLIS_DEFAULT_MR_THREAD_MAX, each of which specify the maximum ways of parallelism for the jr and ir loops, respectively. (Those variables should be renamed, btw.) In this example, assuming the problem size was square, BLIS would end up using ic_nt = 4 and jc_nt = 2.

This issue concerns the default values for BLIS_DEFAULT_NR_THREAD_MAX and BLIS_DEFAULT_MR_THREAD_MAX, as specified in frame/include/bli_kernel_macro_defs.h:

#ifndef BLIS_DEFAULT_MR_THREAD_MAX
#define BLIS_DEFAULT_MR_THREAD_MAX 1
#endif

#ifndef BLIS_DEFAULT_NR_THREAD_MAX
#define BLIS_DEFAULT_NR_THREAD_MAX 4
#endif

The default value of 1 for BLIS_DEFAULT_MR_THREAD_MAX makes sense because the ir loop is seldom parallelized, since it only makes sense when the L1 data cache is shared among some group of cores.

However, I think BLIS_DEFAULT_NR_THREAD_MAX should be also be set to 1. The reason is as follows: generally speaking, extracting parallelism from the jr loop is prescribed only when the L2 cache is shared among some group of cores. The most prevalent microarchitectures we use today are Haswell/Broadwell/Skylake/Kabylake, all of which allocate a private L2 to each core. And while there are examples of other systems where the L2 cache is shared, those cases are not widespread, and they can always be handled by overriding the defaults in the corresponding bli_family_<arch>.h file within the configuration.

So, my proposed change is simple:

#ifndef BLIS_DEFAULT_MR_THREAD_MAX
#define BLIS_DEFAULT_MR_THREAD_MAX 1
#endif

#ifndef BLIS_DEFAULT_NR_THREAD_MAX
#define BLIS_DEFAULT_NR_THREAD_MAX 1
#endif

Thus, unless overridden by the configuration, this would force all automatically obtained parallelism to the jc and ic loops.

fgvanzee avatar Oct 09 '18 21:10 fgvanzee

It still makes sense to have JR_NT > 1 with private L2 caches in some (many?) cases. The increase in cache coherency traffic can be offset by the savings of sharing B.

Although, based on early experiments with setting up the algorithm to prefer m >= n instead of matching the microkernel preference (i.e. "ambidextrous" microkernels) show that JR_NT > 1 may be unnecessary in this case.

devinamatthews avatar Oct 09 '18 21:10 devinamatthews

The increase in cache coherency traffic can be offset by the savings of sharing B.

Can you elaborate on the savings you're referring to here? The regime I envision would still involve ic_nt threads sharing each panel of B.

fgvanzee avatar Oct 09 '18 21:10 fgvanzee

You always get IC_NT independent panels of B...right?

devinamatthews avatar Oct 09 '18 21:10 devinamatthews

Er, I mean 'JC_NT'. But the amount of sharing is less.

devinamatthews avatar Oct 09 '18 21:10 devinamatthews

Let's assume ic_nt = 4 and jc_nt = 2. This results in two packed panels of B being created. Each panel of B would be shared across 4 threads.

fgvanzee avatar Oct 09 '18 21:10 fgvanzee

Right, instead of one panel shared across 8 threads if we had jr_nt = 2.

devinamatthews avatar Oct 09 '18 21:10 devinamatthews

I guess what I don't quite follow is this perceived benefit of threads "sharing" B.

fgvanzee avatar Oct 09 '18 21:10 fgvanzee

I am badly explaining what was told to me by @tlrmchlsmth some time ago. But the benchmark is the ultimate authority.

devinamatthews avatar Oct 09 '18 21:10 devinamatthews

FWIW, @dnparikh already has performance data for trsm on ThunderX2 (which has a private L2 cache) that shows a big difference between pushing all parallelism to the jr loop vs. splitting it between the jc and jr loops. (BLIS does not yet support ic parallelism in trsm, so we haven't been able to look at fiddling with ic_nt yet.)

I'll see if I can see any difference on my four-core Haswell.

fgvanzee avatar Oct 09 '18 21:10 fgvanzee

Yes I imagine relying on cache coherency performance on ARM is a problem.

devinamatthews avatar Oct 09 '18 22:10 devinamatthews

Yes I imagine relying on cache coherency performance on ARM is a problem.

Can you elaborate?

fgvanzee avatar Oct 10 '18 17:10 fgvanzee

Intel has very fast communication between private caches through the ring buffer (mesh network on newer chips). I highly doubt ARM can hold a candle to it.

devinamatthews avatar Oct 10 '18 17:10 devinamatthews

I couldn't gather any meaningful inferences from the data I collected on my Haswell workstation. (Not enough cores to play with.) Maybe I'm willing to punt on this issue for now.

@dnparikh I think we can achieve what we discussed for trsm (splitting parallelism between jc and jr loops) without this issue moving forward. Instead, we can build that logic directly into bli_rntm_set_ways_for_op().

fgvanzee avatar Oct 10 '18 17:10 fgvanzee

Also note that these settings are (or are supposed to be) per-configuration.

devinamatthews avatar Oct 10 '18 17:10 devinamatthews

I understand. This issue was always about changing the default values to values that would work as the best starting point.

fgvanzee avatar Oct 10 '18 17:10 fgvanzee