mini-sglang icon indicating copy to clipboard operation
mini-sglang copied to clipboard

[Feature] Add MLA configuration and KV cache storage kernel

Open DhiraPT opened this issue 3 days ago • 0 comments

Summary

This PR lays the groundwork for Multi-Head Latent Attention (MLA) support by implementing the necessary configuration parsing and KV cache storage infrastructure. It introduces a specialized CUDA kernel to handle the unique storage requirements of MLA (compressed latent vectors + RoPE parts) and includes comprehensive tests to verify correctness and performance.

Key Changes

  • Configuration (models/config.py):

    • Added MLAConfig dataclass to hold MLA-specific parameters.
    • Updated ModelConfig to detect MLA models (via kv_lora_rank) and correctly parse parameters like qk_rope_head_dim, qk_nope_head_dim, and v_head_dim.
    • Adjusted RotaryConfig initialization to use qk_rope_head_dim when MLA is active.
  • Kernel (kernel/csrc/jit/store.cu):

    • Implemented store_mla_cache_kernel to store the compressed latent vector (kv_c) and the RoPE part (k_rope) into a single contiguous kv_buffer.
    • Added StoreMLAKernel struct with JIT compilation support using minisgl's existing infrastructure.
  • Python Interface (kernel/store.py, kernel/__init__.py):

    • Exposed store_mla_cache to handle the JIT compilation and kernel launch for MLA cache storage.
    • Exported store_mla_cache in __init__.py to make it accessible to the package.
  • Tests (tests/kernel/test_store.py):

    • Added test_store_mla_cache to verify the correctness of the MLA cache layout and benchmark its performance against a PyTorch baseline.

DhiraPT avatar Dec 23 '25 19:12 DhiraPT