tvm [KVCache] Per Layer Sliding Window

[KVCache] Per Layer Sliding Window

Open joshua-j-hong opened this issue 6 months ago • 2 comments

trafficstars

Adds per layer sliding window functionality to the KV Cache. Correctness is mostly achieved, but there are some cases where single tokens are strange. The corresponding MLC-LLM PR is https://github.com/mlc-ai/mlc-llm/pull/3248

A full list of changes and additions are below

Add a new attention type for per-layer sliding window called MHA_SLIDING
Add corresponding vectors for per-layer sliding window offset calculations
For sliding window attention enabled KV-cache, regular sliding window is disabled to prevent page eviction
Gemma3 has different rope parameters for local sliding window layers. This should be passed as a parameter for the KVCache, but currently these values are hardcoded

May 07 '25 19:05 joshua-j-hong

tvm tvm copied to clipboard

[KVCache] Per Layer Sliding Window

tvm
tvm copied to clipboard