tvm icon indicating copy to clipboard operation
tvm copied to clipboard

[KVCache] Per Layer Sliding Window

Open joshua-j-hong opened this issue 6 months ago • 2 comments
trafficstars

Adds per layer sliding window functionality to the KV Cache. Correctness is mostly achieved, but there are some cases where single tokens are strange. The corresponding MLC-LLM PR is https://github.com/mlc-ai/mlc-llm/pull/3248

A full list of changes and additions are below

  • Add a new attention type for per-layer sliding window called MHA_SLIDING
  • Add corresponding vectors for per-layer sliding window offset calculations
  • For sliding window attention enabled KV-cache, regular sliding window is disabled to prevent page eviction
  • Gemma3 has different rope parameters for local sliding window layers. This should be passed as a parameter for the KVCache, but currently these values are hardcoded

joshua-j-hong avatar May 07 '25 19:05 joshua-j-hong