llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc

Open l15y opened this issue 1 month ago • 1 comments

Feature Description

Current llama.cpp implementation doesn't optimally utilize NUMA architecture when running Mixture-of-Experts (MoE) models, potentially leaving significant performance gains untapped.

Proposed Solution

Implement NUMA-aware expert allocation through one or more of these approaches:

  1. Process-Level Binding

    • Integrate numactl-like functionality directly into llama.cpp
    • Allow specifying NUMA nodes per expert group via CLI/config
  2. Thread Affinity Control

    • Add pthread/OpenMP affinity binding for expert computation threads
    • Example: --numa-expert-map "0-7:0,8-15:1" (experts 0-7 on NUMA0, 8-15 on NUMA1)
  3. NUMA-Aware Memory Allocation

    • Leverage libnuma for expert weight allocations
    • Implement mmap strategy with MAP_FIXED_NOREPLACE for specific nodes

Performance Considerations

  • Cross-NUMA communication cost vs. compute density tradeoff
  • Automatic topology detection vs. manual mapping
  • Support for hybrid CPU+accelerator configurations

l15y avatar Jan 21 '25 16:01 l15y