ex_cpu - hwloc - explicit NUMA / NUCA partitioning
Currently, ex_cpu detects Non-Uniform Cache Architecture (NUCA) as well as NUMA and optimizes work-stealing to reduce cross-cache and cross-node sharing. However, this is only done by deprioritizing distant nodes; sharing will still occur in many scenarios.
It would be useful to expose an API that allows access to the NUMA/NUCA info to the user to create explicit partitions that won't share memory at all. For example, a user may want to:
- partitition the threads into 4 groups
- partition the threads according to shared L3 cache
- partition the threads into 1-thread-per-core
It should be possible to do this either in the same process (by creating multiple ex_cpu with a different partition index), or in multiple processes. In either case, the partition structure and indexes to access it need to be stable across multiple invocations.
Auto-partitioning would also be good to scale to different hardwares - similar to the current work-stealing heuristic, but creating independent executors instead. The user could ask for partition by L3 cache, and receive back an initialized array of ex_cpu.
It would also be helpful to implement an API to distribute work amongst an auto-partitioned (or manually partitioned) group of executors.
See https://hwloc.readthedocs.io/en/stable/group__hwlocality__distances__get.html which returns some info for NUMA, although this is known to be fairly useless for NUCA.