ex_cpu - hwloc - explicit NUMA / NUCA partitioning

Open tzcnt opened this issue 9 months ago • 0 comments

Currently, ex_cpu detects Non-Uniform Cache Architecture (NUCA) as well as NUMA and optimizes work-stealing to reduce cross-cache and cross-node sharing. However, this is only done by deprioritizing distant nodes; sharing will still occur in many scenarios.

It would be useful to expose an API that allows access to the NUMA/NUCA info to the user to create explicit partitions that won't share memory at all. For example, a user may want to:

partitition the threads into 4 groups
partition the threads according to shared L3 cache
partition the threads into 1-thread-per-core

It should be possible to do this either in the same process (by creating multiple ex_cpu with a different partition index), or in multiple processes. In either case, the partition structure and indexes to access it need to be stable across multiple invocations.

Auto-partitioning would also be good to scale to different hardwares - similar to the current work-stealing heuristic, but creating independent executors instead. The user could ask for partition by L3 cache, and receive back an initialized array of ex_cpu.

It would also be helpful to implement an API to distribute work amongst an auto-partitioned (or manually partitioned) group of executors.

See https://hwloc.readthedocs.io/en/stable/group__hwlocality__distances__get.html which returns some info for NUMA, although this is known to be fairly useless for NUCA.

Apr 09 '25 14:04 tzcnt