tvm
tvm copied to clipboard
[KVCache] Per Layer Sliding Window
trafficstars
Adds per layer sliding window functionality to the KV Cache. Correctness is mostly achieved, but there are some cases where single tokens are strange. The corresponding MLC-LLM PR is https://github.com/mlc-ai/mlc-llm/pull/3248
A full list of changes and additions are below
- Add a new attention type for per-layer sliding window called
MHA_SLIDING - Add corresponding vectors for per-layer sliding window offset calculations
- For sliding window attention enabled KV-cache, regular sliding window is disabled to prevent page eviction
- Gemma3 has different rope parameters for local sliding window layers. This should be passed as a parameter for the KVCache, but currently these values are hardcoded