parsec icon indicating copy to clipboard operation
parsec copied to clipboard

NUMA issues

Open omor1 opened this issue 2 years ago • 1 comments

Description

PaRSEC deals with NUMA machines by ignoring any NUMA issues. This leads to sub-optimal performance on nearly all modern architectures, especially on larger machines with many NUMA zones e.g. clusters with 2x AMD Rome. Current workarounds to this include:

  • Ignoring it, taking the efficiency hit
  • numactl --interleave or other numactl options
  • Running on one socket or NUMA zone (per caffeine break discussion with @devreal)
  • Running one process per socket or NUMA zone and relying on the communication engine to deal with intra-node communication
    • This in particular causes changes in how the application runs and the problem is distributed
    • More imbalance, since tasks cannot be stolen between different processes
  • Attempt to use PaRSEC's Virtual Process infrastructure
    • This is followed by the realization that it hasn't been used in years
    • It might have suffered bitrot
    • Doesn't really solve any NUMA issues

PaRSEC should be able to deal with NUMA machines in a smarter way. This issue exists to discuss the various issues encountered with NUMA, why the above solutions and workarounds do not work, and determine a way forward.

Details of current NUMA issues

Memory allocation

PaRSEC will allocate memory for different purposes. Broadly, I'll categorize these into three categories:

  1. Internal PaRSEC data structures e.g. parsec_task_t
  2. Task data, data used for flows
  3. Communication buffers

Accesses to internal data structures are often significant for e.g. scheduling latencies, but it is often not obvious where they should be placed and in any case should not constitute a significant portion of wall time.

Placement of task data and communication buffers is, however, more significant and their placement location is also more obvious—namely that task data ought to be placed close to where the tasks using said data will execute and communication buffers close to the communication thread (and the network hardware). There are, however, some complicating factors with this:

  1. Tasks can be stolen and thereby execute on cores far from where their data is located; in general this isn't an issue, since if tasks must be stolen, the core has run out of work and executing a task sub-optimally is better than deferring its execution
  2. Buffers used for communication and for task data are one and the same, when data is not sent eagerly

Scheduling of tasks from remote activations

Currently, all tasks spawned from the communication thread when remote activations are received are pushed to the scheduling queue of thread 0. There are several issues with this:

  1. Depending on the scheduler implementation, this can overload one core with tasks and underload the others, as well as causing (potentially many) tasks to overflow onto the system queue
  2. The default "scheduling location" for all such tasks is thread 0; depending on where the tasks' memory is allocated this is not ideal, particularly when compounded with the potential for frequent overflows to the system queue
  3. Thread 0 may be located on a different NUMA zone than the communication thread

NUMA-aware scheduling

None of the current schedulers are truly NUMA-aware, in the sense of attempting to place tasks that share data close by to each-other.

  • LFQ, LTQ, and PBQ search for stealable tasks in the queues of other threads in order of the hwloc hierarchy, which gives them an implicit sort of NUMA-awareness.
  • LTQ attempts to group tasks with the same inputs together into heaps, but this only works if all the tasks can be scheduled together in the same call i.e. all their inputs are ready.
  • LHQ has a hwloc-ordered hierarchy of queues

This is a difficult problem with no obvious solution. There is, I believe, a wealth of literature on the topic, which should be examined. One suggestion by @bosilca was to utilize a helper thread looking for tasks to aggregate.

Current workarounds

Ignoring the problem

"La la la la I can't hear you"

numactl --interleave

This uses numactl to force all* memory pages to be interleaved between the different NUMA zones. This can improve performance in some cases, but might make others worse, depending on where and how allocations occur.

* There might be some cases where some memory is not interleaved, TBD.

Use only one socket/NUMA zone

This technically prevents NUMA issues, but only by ignoring half (or more!) of the computational resources available; this is not a viable solution.

One process per socket/NUMA zone

This is likely the most common current solution. This is problematic for several reasons:

  1. Duplication of on-node shared resources
  2. Dependence on communication engine for shared memory transfers
    • Usually not a problem with MPI
    • Not (currently) supported by LCI; we made the assumption when designing and implementing the interface that runtimes prefer to run with one process per node
  3. No inter-process, intra-node task stealing
  4. Changes data distribution and task execution flow

Solutions

Hope NUMA goes away

This seems unlikely, and increasingly less likely in future machines.

Improve one-process-per-NUMA

Improving some of the deficiencies in the above solution is possible, albeit with no small amount of engineering effort. Some data structures could be placed in memory shared between the two (or more) processes; doing so could allow sharing of the communication thread and enable cross-process task stealing. Using shared memory for intra-node communication or directly attaching memory from sibling processes is also a possibility. However, all these are very complex solutions for the problem.

Improving and redesigning Virtual Processes

My understanding is that currently, VPs do one thing and one thing only: prevent tasks from being stolen between different VPs. This is not what I would expect nor what I think most users desire. Given a clean slate, I'd imagine Virtual Processes thus:

  1. An abstraction for a "memory [and IO] space"
  2. Have separate task queues
    • However, if the queues for one VP are empty, they should attempt to steal from another VP
  3. Arranged in a hierarchy
    • Machine hierarchies are only growing more complex
    • e.g. an AMD Rome system might have two sockets
      • Each of which has four NUMA zones
        • Each of which has four L3 cache slices
          • Each of which has four cores

Allocating memory from a particular NUMA zone is possible on Linux via libnuma, see man numa(3). I believe that other systems targeted allow similar mechanisms.

By allowing the runtime to choose the initial placement for a task and its associated data, it may be possible to speed up task execution. Up for debate is where incoming task data from remote nodes ought to be placed, as there are arguments both for placing it close to the network and close to where it will be executed; this is possibly system-dependent and might require some tuning.

Other thoughts

This is an initial description of some of the problems encountered with NUMA and PaRSEC, their current workarounds, and future solutions. Others are very welcome to chime in with their thoughts, experiences, and suggestions, as I have almost definitely forgotten something or simply don't know about it.

Thanks!

omor1 avatar Jun 23 '22 22:06 omor1