blis add method to query HW size

This is my first step towards addressing the oversubscription encountered in https://github.com/flame/blis/issues/588.

This adds a utility to figure out how many CPUs (hardware threads) are available in the affinity mask of the calling context. This is a stronger restriction than the number of available software threads.

@egaudry

Signed-off-by: Jeff Hammond [email protected]

Open questions (not all need to be addressed in this PR):

[ ] 1) What about windows?
- HELP! @isuruf
[ ] 2) What should happen when called in OMP threaded region?
- Respect nested OMP settings (e.g. omp_get_max_threads())?
- Single-threaded?
- Configurable?
[ ] 3) What should happen when n_threads > n_hw_cores?
- Abort? @devinamatthews does not like
- Round-robin?
- Dynamic?
- Configurable?
[ ] 4) When maintaining a pthreads thread pool (also TODO), also use CPU mask from user?
- No easy way to know when we're in a user parallel region. Should the user be responsible for setting a CPU mask?
[ ] 5) What if the user doesn't set the CPU mask appropriately when running multiple user threads? (Could potentially be addressed in pthreads mode)
- Threads for all HW cores could be created and pinned at initialization and "checked out" when needed.

May 03 '22 06:05 jeffhammond

Sorry @jeffhammond I hijacked the PR description to flesh out some ideas related to the larger problem of thread pinning.

May 03 '22 20:05 devinamatthews

To be clear, final version will not abort. But I could not figure out how to set BLIS threading variables correctly. That's why there's a preprocessing warning saying "please help me".

May 04 '22 07:05 jeffhammond

@jeffhammond once we're in the OMP region, it's kind of too late to do anything except, as in bli_l3_thread_decorator_thread_check, detect that we have only one thread when we expected more (e.g. if nested OMP is disabled). I think this check would have to come before the OMP region and initialization of the global communicator.

May 04 '22 14:05 devinamatthews

@jeffhammond can you walk me through the logic of the PR as implemented? I'm not sure how the CPU mask alone can reliably be used to detect oversubscription.

May 04 '22 14:05 devinamatthews

I agree. We can do slightly better than serialization if nested is enabled but I don't think that's an important thing to spend time on.

May 04 '22 14:05 jeffhammond

the affinity mask HW thread count is compared to the user-specified SW thread count.

May 04 '22 14:05 jeffhammond

Ah OK, so the user could in principle reduce the number of HW cores "visible" to BLIS by setting a mask but if no action is taken then BLIS just sees all cores. OK.

May 04 '22 14:05 devinamatthews

@jeffhammond sorry to be dense but I still don't see how this check fixes the oversubscription problem in #588. Each "instance" of BLIS would still pass the check, but the total number of threads is more than the number of cores. Detecting nested OMP and dropping down to one core could work, but would have to be configurable. Otherwise, maybe we could check omp_get_max_threads?

May 04 '22 14:05 devinamatthews

I've asked @egaudry to test with his application, which may be settings affinity masks via MPI or HWLOC.

But I can say that it works to detect oversubscription for me. If I set OMP_NUM_THREADS=16 on an Intel CPU with 8 HW threads, it aborts. When I set OMP_NUM_THREADS=8, it's fine. I added a test that can do other things.

May 04 '22 15:05 jeffhammond

If I set OMP_NUM_THREADS=16 on an Intel CPU with 8 HW threads, it aborts. When I set OMP_NUM_THREADS=8, it's fine. I added a test that can do other things.

Yes that should work fine, but setting OMP_NUM_THREADS=BLIS_NUM_THREADS=4 and nesting should not abort and yet oversubscribe. Isn't this the more pressing problem?

May 04 '22 15:05 devinamatthews

Here are my suggested solutions:

Maintain a process-wide atomic int which counts the number of active threads. Abort (or warn etc.) if this exceeds the number of HW cores.
Prepare a set of tokens, one for each HW core. Threads check out these tokens when created. Abort if no tokens are available. This approach can be extended to address thread pinning since each token can indicate a particluar physical core.

May 04 '22 15:05 devinamatthews

Also, I suggest an environment variable BLIS_OVERSUBSCRIBE. The values ABORT and WARN would do what you expect and any other value ignores oversubscription. Maybe also a SERIALIZE option?

May 04 '22 15:05 devinamatthews

The token idea doesn't work if BLIS runs alongside other things that use threads. But I'll see what I can do.

I disagree that nesting is the higher priority. The bug that motivated all of this was not because of nested OpenMP.

May 05 '22 18:05 jeffhammond

@egaudry @jeffhammond the issue in #588 was explained here. How is that not nested OMP?

May 05 '22 18:05 devinamatthews

Okay it was nested open but also affinity. I can fix them both.

May 05 '22 18:05 jeffhammond

Sorry, I haven't had the time to check on my side.

@jeffhammond is correct that the issue was seen because an affinity mask was set @devinamatthews is also correct that the issue originated from the oversubscription

to clarify, the slowdown occurred because more thread than available physical cores where used in BLIS within a region protected with an openmp lock. this could happen as well with oversubscription in case of a nested omp loop, with more thread being used than physically available cores.

May 10 '22 12:05 egaudry

blis blis copied to clipboard

add method to query HW size

blis
blis copied to clipboard