llama.cpp Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc

Feature Description

Current llama.cpp implementation doesn't optimally utilize NUMA architecture when running Mixture-of-Experts (MoE) models, potentially leaving significant performance gains untapped.

Proposed Solution

Implement NUMA-aware expert allocation through one or more of these approaches:

Process-Level Binding
- Integrate numactl-like functionality directly into llama.cpp
- Allow specifying NUMA nodes per expert group via CLI/config
Thread Affinity Control
- Add pthread/OpenMP affinity binding for expert computation threads
- Example: --numa-expert-map "0-7:0,8-15:1" (experts 0-7 on NUMA0, 8-15 on NUMA1)
NUMA-Aware Memory Allocation
- Leverage libnuma for expert weight allocations
- Implement mmap strategy with MAP_FIXED_NOREPLACE for specific nodes

Performance Considerations

Cross-NUMA communication cost vs. compute density tradeoff
Automatic topology detection vs. manual mapping
Support for hybrid CPU+accelerator configurations

Jan 21 '25 16:01 l15y

Potential Speed Improvements for DeepSeek V3 Inference via NUMA-Aware MoE Allocation

Deploying DeepSeek V3 or similar large MoE models with NUMA-aware expert allocation could yield significant performance gains through these mechanisms:

1. Memory Access Optimization

Localized Memory Access
- Storing experts on local NUMA nodes reduces cross-node latency (typical NUMA latency varies 2-5× between local/remote accesses).
- For CPU-bound inference scenarios, this could improve token generation speed by 10-25%.
Cache Utilization
Thread-core affinity binding improves L3 cache reuse:
- If 30% of expert computations rely on cached data, this may reduce cache miss penalties by 15-30%.

2. Enhanced Compute Parallelism

Expert-Level Parallelism
For DeepSeek V3’s sparsely activated experts (e.g., 2/8 experts per token):
- Co-locating experts on the same NUMA node reduces contention for memory bandwidth.
- On dual-socket EPYC servers, this could achieve 20-40% throughput gains (aligned with GSPMD experiments).
Load Balancing
Dynamic routing (e.g., Switch Transformer) may cause uneven expert utilization:
- NUMA-aware scheduling can redistribute hotspot experts to idle nodes, reducing tail latency.

3. Reduced Communication Overhead

Inter-Layer Data Transfer
All-to-All communication between MoE layers may consume 15% of inference time in cross-NUMA setups:
- Prioritizing intra-node routing could cut cross-NUMA traffic by 30-50% (per TensorFlow NUMA guidance).

Jan 22 '25 01:01 l15y

If it is possible may be it's better to continue with #9086 to support tensor parallel. Then treat numa node as a standalone computation unit and speed up by column and row wise tensor parallel. So that numa node can be running with limited cross memory access then.

Feb 24 '25 03:02 Readon

Referenced this on this benchmarking example of 6 numa node dual socket intel xeon 6980P case:

https://github.com/ggml-org/llama.cpp/discussions/12088

Feb 27 '25 21:02 ubergarm

Bumping this, as large MoE models like DeepSeek v3 and Llama 4 Maverick continue to gain popularity. Dual-socket servers with large amounts of memory are readily available for reasonable prices second hand, and are generally the most economical way to run such large models. My dual-socket Xeon 4216 server has 230 GB/s of memory bandwidth (115 GB/s per socket), but it only does token generation on Llama 4 Maverick IQ3_XL in Llama.cpp at 4.3 tok/s because of NUMA bottlenecks. Just allocating MoE experts in a NUMA aware manner should give a big speedup even without implementing tensor parallelism.

Apr 10 '25 21:04 sultanqasim

@sultanqasim

The closest thing I've seen to allocating Explicit Huge Pages in multiple NUMA nodes for "data paralle" (e.g. 2x full copies of the weights) is this discussion and fork you could try out yourself: https://github.com/vproxy-tools/llama.cpp/

Also some discussion more specific to the allocation of threads here: https://github.com/ggml-org/llama.cpp/pull/12488 however you can do some of that directly with taskset on the above code.

Keep us posted with how much faster it is on that thread!

Apr 11 '25 14:04 ubergarm

This issue was closed because it has been inactive for 14 days since being marked as stale.

May 26 '25 01:05 github-actions[bot]

llama.cpp llama.cpp copied to clipboard

Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc

Feature Description

Proposed Solution

Performance Considerations

1. Memory Access Optimization

2. Enhanced Compute Parallelism

3. Reduced Communication Overhead

llama.cpp
llama.cpp copied to clipboard