llama.cpp
llama.cpp copied to clipboard
Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc
Feature Description
Current llama.cpp implementation doesn't optimally utilize NUMA architecture when running Mixture-of-Experts (MoE) models, potentially leaving significant performance gains untapped.
Proposed Solution
Implement NUMA-aware expert allocation through one or more of these approaches:
-
Process-Level Binding
- Integrate
numactl-like functionality directly into llama.cpp - Allow specifying NUMA nodes per expert group via CLI/config
- Integrate
-
Thread Affinity Control
- Add pthread/OpenMP affinity binding for expert computation threads
- Example:
--numa-expert-map "0-7:0,8-15:1"(experts 0-7 on NUMA0, 8-15 on NUMA1)
-
NUMA-Aware Memory Allocation
- Leverage
libnumafor expert weight allocations - Implement
mmapstrategy withMAP_FIXED_NOREPLACEfor specific nodes
- Leverage
Performance Considerations
- Cross-NUMA communication cost vs. compute density tradeoff
- Automatic topology detection vs. manual mapping
- Support for hybrid CPU+accelerator configurations
Potential Speed Improvements for DeepSeek V3 Inference via NUMA-Aware MoE Allocation
Deploying DeepSeek V3 or similar large MoE models with NUMA-aware expert allocation could yield significant performance gains through these mechanisms:
1. Memory Access Optimization
-
Localized Memory Access
- Storing experts on local NUMA nodes reduces cross-node latency (typical NUMA latency varies 2-5× between local/remote accesses).
- For CPU-bound inference scenarios, this could improve token generation speed by 10-25%.
-
Cache Utilization
Thread-core affinity binding improves L3 cache reuse:- If 30% of expert computations rely on cached data, this may reduce cache miss penalties by 15-30%.
2. Enhanced Compute Parallelism
-
Expert-Level Parallelism
For DeepSeek V3’s sparsely activated experts (e.g., 2/8 experts per token):- Co-locating experts on the same NUMA node reduces contention for memory bandwidth.
- On dual-socket EPYC servers, this could achieve 20-40% throughput gains (aligned with GSPMD experiments).
-
Load Balancing
Dynamic routing (e.g., Switch Transformer) may cause uneven expert utilization:- NUMA-aware scheduling can redistribute hotspot experts to idle nodes, reducing tail latency.
3. Reduced Communication Overhead
- Inter-Layer Data Transfer
All-to-All communication between MoE layers may consume 15% of inference time in cross-NUMA setups:- Prioritizing intra-node routing could cut cross-NUMA traffic by 30-50% (per TensorFlow NUMA guidance).
If it is possible may be it's better to continue with #9086 to support tensor parallel. Then treat numa node as a standalone computation unit and speed up by column and row wise tensor parallel. So that numa node can be running with limited cross memory access then.
Referenced this on this benchmarking example of 6 numa node dual socket intel xeon 6980P case:
https://github.com/ggml-org/llama.cpp/discussions/12088
Bumping this, as large MoE models like DeepSeek v3 and Llama 4 Maverick continue to gain popularity. Dual-socket servers with large amounts of memory are readily available for reasonable prices second hand, and are generally the most economical way to run such large models. My dual-socket Xeon 4216 server has 230 GB/s of memory bandwidth (115 GB/s per socket), but it only does token generation on Llama 4 Maverick IQ3_XL in Llama.cpp at 4.3 tok/s because of NUMA bottlenecks. Just allocating MoE experts in a NUMA aware manner should give a big speedup even without implementing tensor parallelism.
@sultanqasim
The closest thing I've seen to allocating Explicit Huge Pages in multiple NUMA nodes for "data paralle" (e.g. 2x full copies of the weights) is this discussion and fork you could try out yourself: https://github.com/vproxy-tools/llama.cpp/
Also some discussion more specific to the allocation of threads here: https://github.com/ggml-org/llama.cpp/pull/12488 however you can do some of that directly with taskset on the above code.
Keep us posted with how much faster it is on that thread!
This issue was closed because it has been inactive for 14 days since being marked as stale.