llama.cpp Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc

Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc

Open l15y opened this issue 1 month ago • 1 comments

Feature Description

Current llama.cpp implementation doesn't optimally utilize NUMA architecture when running Mixture-of-Experts (MoE) models, potentially leaving significant performance gains untapped.

Proposed Solution

Implement NUMA-aware expert allocation through one or more of these approaches:

Process-Level Binding
- Integrate numactl-like functionality directly into llama.cpp
- Allow specifying NUMA nodes per expert group via CLI/config
Thread Affinity Control
- Add pthread/OpenMP affinity binding for expert computation threads
- Example: --numa-expert-map "0-7:0,8-15:1" (experts 0-7 on NUMA0, 8-15 on NUMA1)
NUMA-Aware Memory Allocation
- Leverage libnuma for expert weight allocations
- Implement mmap strategy with MAP_FIXED_NOREPLACE for specific nodes

Performance Considerations

Cross-NUMA communication cost vs. compute density tradeoff
Automatic topology detection vs. manual mapping
Support for hybrid CPU+accelerator configurations

Jan 21 '25 16:01 l15y

llama.cpp llama.cpp copied to clipboard

Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc

Feature Description

Proposed Solution

Performance Considerations

llama.cpp
llama.cpp copied to clipboard