[Feature] Add MLA configuration and KV cache storage kernel

Open DhiraPT opened this issue 3 days ago • 0 comments

Summary

This PR lays the groundwork for Multi-Head Latent Attention (MLA) support by implementing the necessary configuration parsing and KV cache storage infrastructure. It introduces a specialized CUDA kernel to handle the unique storage requirements of MLA (compressed latent vectors + RoPE parts) and includes comprehensive tests to verify correctness and performance.

Key Changes

Configuration (models/config.py):
- Added MLAConfig dataclass to hold MLA-specific parameters.
- Updated ModelConfig to detect MLA models (via kv_lora_rank) and correctly parse parameters like qk_rope_head_dim, qk_nope_head_dim, and v_head_dim.
- Adjusted RotaryConfig initialization to use qk_rope_head_dim when MLA is active.
Kernel (kernel/csrc/jit/store.cu):
- Implemented store_mla_cache_kernel to store the compressed latent vector (kv_c) and the RoPE part (k_rope) into a single contiguous kv_buffer.
- Added StoreMLAKernel struct with JIT compilation support using minisgl's existing infrastructure.
Python Interface (kernel/store.py, kernel/__init__.py):
- Exposed store_mla_cache to handle the JIT compilation and kernel launch for MLA cache storage.
- Exported store_mla_cache in __init__.py to make it accessible to the package.
Tests (tests/kernel/test_store.py):
- Added test_store_mla_cache to verify the correctness of the MLA cache layout and benchmark its performance against a PyTorch baseline.

Dec 23 '25 19:12 DhiraPT