llama.cpp
llama.cpp copied to clipboard
Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc
Feature Description
Current llama.cpp implementation doesn't optimally utilize NUMA architecture when running Mixture-of-Experts (MoE) models, potentially leaving significant performance gains untapped.
Proposed Solution
Implement NUMA-aware expert allocation through one or more of these approaches:
-
Process-Level Binding
- Integrate
numactl
-like functionality directly into llama.cpp - Allow specifying NUMA nodes per expert group via CLI/config
- Integrate
-
Thread Affinity Control
- Add pthread/OpenMP affinity binding for expert computation threads
- Example:
--numa-expert-map "0-7:0,8-15:1"
(experts 0-7 on NUMA0, 8-15 on NUMA1)
-
NUMA-Aware Memory Allocation
- Leverage
libnuma
for expert weight allocations - Implement
mmap
strategy withMAP_FIXED_NOREPLACE
for specific nodes
- Leverage
Performance Considerations
- Cross-NUMA communication cost vs. compute density tradeoff
- Automatic topology detection vs. manual mapping
- Support for hybrid CPU+accelerator configurations