kvcached icon indicating copy to clipboard operation
kvcached copied to clipboard

[RFC] Support multiple attention types

Open ivanium opened this issue 1 month ago • 0 comments

Motivation

Emerging models are increasingly adopting a hybrid of multiple attention types to capture different aspects of the input data. For example, GPT-oss uses a combination of full attention and sliding window attention. In this RFC, we propose to support multiple attention types in kvcached.

Background

Currently, kvcached assumes all layers have the same attention type and are uniform. This assumption drives some interface designs and optimizations. Currently, there are three issues.

  1. Interface: we only allow user to pass in one KV tensor shape and the number of layers.
  2. Data layout: when organizing all KV tensors, we adopt a contiguous layout at the low level that makes KV cache blocks across different layers contiguous in memory, while reshapes the tensors to the high-level python code.
  3. Optimization: when map/unmap physical pages, we operate once for all layers, assuming they are the same.

Proposed Changes and Milestones

Fully implmenting this feature requires a major refactor of the codebase, including the Python interface, C++ backend, and (potentially) the Python KV cache manager. We propose to break down the changes into several milestones.

For the first milestone, we will first revise/add the necessary interfaces and data layout changes to support multiple attention types. This will be a relatively small change, but will lay the groundwork for the future refactor.

For the second milestone, we will implement the simplest C++ backend to support multiple attention types. Specifically, we can follow what the serving engines are doing now, which groups layers by their attention types, and create different KV cache pool for each group. Similarly, we can create different FTensor pools in our C++ backend.

For the third milestone, we will implement the Python KV cache manager to support multiple attention types. Specifically, we can follow what the serving engines are doing now, which groups layers by their attention types, and create different KV cache pool for each group. Similarly, we can create different FTensor pools in our Python KV cache manager.

At this point, we should have a working implementation that supports multiple attention types but may not be the most efficient.

For the fourth milestone, we will refactor the C++ codebase, and see how we can use the FTensor abstraction to unify the storage layout of different attention types, so that different attention types can share one single KV cache pool for highest efficiency.

ivanium avatar Oct 29 '25 18:10 ivanium