mini-sglang
mini-sglang copied to clipboard
[Feature] Add MLA configuration and KV cache storage kernel
Summary
This PR lays the groundwork for Multi-Head Latent Attention (MLA) support by implementing the necessary configuration parsing and KV cache storage infrastructure. It introduces a specialized CUDA kernel to handle the unique storage requirements of MLA (compressed latent vectors + RoPE parts) and includes comprehensive tests to verify correctness and performance.
Key Changes
-
Configuration (
models/config.py):- Added
MLAConfigdataclass to hold MLA-specific parameters. - Updated
ModelConfigto detect MLA models (viakv_lora_rank) and correctly parse parameters likeqk_rope_head_dim,qk_nope_head_dim, andv_head_dim. - Adjusted
RotaryConfiginitialization to useqk_rope_head_dimwhen MLA is active.
- Added
-
Kernel (
kernel/csrc/jit/store.cu):- Implemented
store_mla_cache_kernelto store the compressed latent vector (kv_c) and the RoPE part (k_rope) into a single contiguouskv_buffer. - Added
StoreMLAKernelstruct with JIT compilation support usingminisgl's existing infrastructure.
- Implemented
-
Python Interface (
kernel/store.py,kernel/__init__.py):- Exposed
store_mla_cacheto handle the JIT compilation and kernel launch for MLA cache storage. - Exported
store_mla_cachein__init__.pyto make it accessible to the package.
- Exposed
-
Tests (
tests/kernel/test_store.py):- Added
test_store_mla_cacheto verify the correctness of the MLA cache layout and benchmark its performance against a PyTorch baseline.
- Added