parsec
parsec copied to clipboard
NUMA issues
Description
PaRSEC deals with NUMA machines by ignoring any NUMA issues. This leads to sub-optimal performance on nearly all modern architectures, especially on larger machines with many NUMA zones e.g. clusters with 2x AMD Rome. Current workarounds to this include:
- Ignoring it, taking the efficiency hit
-
numactl --interleave
or othernumactl
options - Running on one socket or NUMA zone (per caffeine break discussion with @devreal)
- Running one process per socket or NUMA zone and relying on the communication engine to deal with intra-node communication
- This in particular causes changes in how the application runs and the problem is distributed
- More imbalance, since tasks cannot be stolen between different processes
- Attempt to use PaRSEC's Virtual Process infrastructure
- This is followed by the realization that it hasn't been used in years
- It might have suffered bitrot
- Doesn't really solve any NUMA issues
PaRSEC should be able to deal with NUMA machines in a smarter way. This issue exists to discuss the various issues encountered with NUMA, why the above solutions and workarounds do not work, and determine a way forward.
Details of current NUMA issues
Memory allocation
PaRSEC will allocate memory for different purposes. Broadly, I'll categorize these into three categories:
- Internal PaRSEC data structures e.g.
parsec_task_t
- Task data, data used for flows
- Communication buffers
Accesses to internal data structures are often significant for e.g. scheduling latencies, but it is often not obvious where they should be placed and in any case should not constitute a significant portion of wall time.
Placement of task data and communication buffers is, however, more significant and their placement location is also more obvious—namely that task data ought to be placed close to where the tasks using said data will execute and communication buffers close to the communication thread (and the network hardware). There are, however, some complicating factors with this:
- Tasks can be stolen and thereby execute on cores far from where their data is located; in general this isn't an issue, since if tasks must be stolen, the core has run out of work and executing a task sub-optimally is better than deferring its execution
- Buffers used for communication and for task data are one and the same, when data is not sent eagerly
Scheduling of tasks from remote activations
Currently, all tasks spawned from the communication thread when remote activations are received are pushed to the scheduling queue of thread 0. There are several issues with this:
- Depending on the scheduler implementation, this can overload one core with tasks and underload the others, as well as causing (potentially many) tasks to overflow onto the system queue
- The default "scheduling location" for all such tasks is thread 0; depending on where the tasks' memory is allocated this is not ideal, particularly when compounded with the potential for frequent overflows to the system queue
- Thread 0 may be located on a different NUMA zone than the communication thread
NUMA-aware scheduling
None of the current schedulers are truly NUMA-aware, in the sense of attempting to place tasks that share data close by to each-other.
- LFQ, LTQ, and PBQ search for stealable tasks in the queues of other threads in order of the hwloc hierarchy, which gives them an implicit sort of NUMA-awareness.
- LTQ attempts to group tasks with the same inputs together into heaps, but this only works if all the tasks can be scheduled together in the same call i.e. all their inputs are ready.
- LHQ has a hwloc-ordered hierarchy of queues
This is a difficult problem with no obvious solution. There is, I believe, a wealth of literature on the topic, which should be examined. One suggestion by @bosilca was to utilize a helper thread looking for tasks to aggregate.
Current workarounds
Ignoring the problem
"La la la la I can't hear you"
numactl --interleave
This uses numactl
to force all* memory pages to be interleaved between the different NUMA zones. This can improve performance in some cases, but might make others worse, depending on where and how allocations occur.
* There might be some cases where some memory is not interleaved, TBD.
Use only one socket/NUMA zone
This technically prevents NUMA issues, but only by ignoring half (or more!) of the computational resources available; this is not a viable solution.
One process per socket/NUMA zone
This is likely the most common current solution. This is problematic for several reasons:
- Duplication of on-node shared resources
- Dependence on communication engine for shared memory transfers
- Usually not a problem with MPI
- Not (currently) supported by LCI; we made the assumption when designing and implementing the interface that runtimes prefer to run with one process per node
- No inter-process, intra-node task stealing
- Changes data distribution and task execution flow
Solutions
Hope NUMA goes away
This seems unlikely, and increasingly less likely in future machines.
Improve one-process-per-NUMA
Improving some of the deficiencies in the above solution is possible, albeit with no small amount of engineering effort. Some data structures could be placed in memory shared between the two (or more) processes; doing so could allow sharing of the communication thread and enable cross-process task stealing. Using shared memory for intra-node communication or directly attaching memory from sibling processes is also a possibility. However, all these are very complex solutions for the problem.
Improving and redesigning Virtual Processes
My understanding is that currently, VPs do one thing and one thing only: prevent tasks from being stolen between different VPs. This is not what I would expect nor what I think most users desire. Given a clean slate, I'd imagine Virtual Processes thus:
- An abstraction for a "memory [and IO] space"
- Have separate task queues
- However, if the queues for one VP are empty, they should attempt to steal from another VP
- Arranged in a hierarchy
- Machine hierarchies are only growing more complex
- e.g. an AMD Rome system might have two sockets
- Each of which has four NUMA zones
- Each of which has four L3 cache slices
- Each of which has four cores
- Each of which has four L3 cache slices
- Each of which has four NUMA zones
Allocating memory from a particular NUMA zone is possible on Linux via libnuma, see man numa(3)
. I believe that other systems targeted allow similar mechanisms.
By allowing the runtime to choose the initial placement for a task and its associated data, it may be possible to speed up task execution. Up for debate is where incoming task data from remote nodes ought to be placed, as there are arguments both for placing it close to the network and close to where it will be executed; this is possibly system-dependent and might require some tuning.
Other thoughts
This is an initial description of some of the problems encountered with NUMA and PaRSEC, their current workarounds, and future solutions. Others are very welcome to chime in with their thoughts, experiences, and suggestions, as I have almost definitely forgotten something or simply don't know about it.
Thanks!