blis
blis copied to clipboard
add method to query HW size
This is my first step towards addressing the oversubscription encountered in https://github.com/flame/blis/issues/588.
This adds a utility to figure out how many CPUs (hardware threads) are available in the affinity mask of the calling context. This is a stronger restriction than the number of available software threads.
@egaudry
Signed-off-by: Jeff Hammond [email protected]
Open questions (not all need to be addressed in this PR):
- [ ] 1) What about windows?
- HELP! @isuruf
- [ ] 2) What should happen when called in OMP threaded region?
- Respect nested OMP settings (e.g.
omp_get_max_threads())? - Single-threaded?
- Configurable?
- Respect nested OMP settings (e.g.
- [ ] 3) What should happen when
n_threads > n_hw_cores?- Abort? @devinamatthews does not like
- Round-robin?
- Dynamic?
- Configurable?
- [ ] 4) When maintaining a pthreads thread pool (also TODO), also use CPU mask from user?
- No easy way to know when we're in a user parallel region. Should the user be responsible for setting a CPU mask?
- [ ] 5) What if the user doesn't set the CPU mask appropriately when running multiple user threads? (Could potentially be addressed in pthreads mode)
- Threads for all HW cores could be created and pinned at initialization and "checked out" when needed.
Sorry @jeffhammond I hijacked the PR description to flesh out some ideas related to the larger problem of thread pinning.
To be clear, final version will not abort. But I could not figure out how to set BLIS threading variables correctly. That's why there's a preprocessing warning saying "please help me".
@jeffhammond once we're in the OMP region, it's kind of too late to do anything except, as in bli_l3_thread_decorator_thread_check, detect that we have only one thread when we expected more (e.g. if nested OMP is disabled). I think this check would have to come before the OMP region and initialization of the global communicator.
@jeffhammond can you walk me through the logic of the PR as implemented? I'm not sure how the CPU mask alone can reliably be used to detect oversubscription.
I agree. We can do slightly better than serialization if nested is enabled but I don't think that's an important thing to spend time on.
the affinity mask HW thread count is compared to the user-specified SW thread count.
Ah OK, so the user could in principle reduce the number of HW cores "visible" to BLIS by setting a mask but if no action is taken then BLIS just sees all cores. OK.
@jeffhammond sorry to be dense but I still don't see how this check fixes the oversubscription problem in #588. Each "instance" of BLIS would still pass the check, but the total number of threads is more than the number of cores. Detecting nested OMP and dropping down to one core could work, but would have to be configurable. Otherwise, maybe we could check omp_get_max_threads?
I've asked @egaudry to test with his application, which may be settings affinity masks via MPI or HWLOC.
But I can say that it works to detect oversubscription for me. If I set OMP_NUM_THREADS=16 on an Intel CPU with 8 HW threads, it aborts. When I set OMP_NUM_THREADS=8, it's fine. I added a test that can do other things.
If I set OMP_NUM_THREADS=16 on an Intel CPU with 8 HW threads, it aborts. When I set OMP_NUM_THREADS=8, it's fine. I added a test that can do other things.
Yes that should work fine, but setting OMP_NUM_THREADS=BLIS_NUM_THREADS=4 and nesting should not abort and yet oversubscribe. Isn't this the more pressing problem?
Here are my suggested solutions:
- Maintain a process-wide atomic
intwhich counts the number of active threads. Abort (or warn etc.) if this exceeds the number of HW cores. - Prepare a set of tokens, one for each HW core. Threads check out these tokens when created. Abort if no tokens are available. This approach can be extended to address thread pinning since each token can indicate a particluar physical core.
Also, I suggest an environment variable BLIS_OVERSUBSCRIBE. The values ABORT and WARN would do what you expect and any other value ignores oversubscription. Maybe also a SERIALIZE option?
The token idea doesn't work if BLIS runs alongside other things that use threads. But I'll see what I can do.
I disagree that nesting is the higher priority. The bug that motivated all of this was not because of nested OpenMP.
@egaudry @jeffhammond the issue in #588 was explained here. How is that not nested OMP?
Okay it was nested open but also affinity. I can fix them both.
Sorry, I haven't had the time to check on my side.
@jeffhammond is correct that the issue was seen because an affinity mask was set @devinamatthews is also correct that the issue originated from the oversubscription
to clarify, the slowdown occurred because more thread than available physical cores where used in BLIS within a region protected with an openmp lock. this could happen as well with oversubscription in case of a nested omp loop, with more thread being used than physically available cores.