openvino icon indicating copy to clipboard operation
openvino copied to clipboard

[OV GPU] Add the capability for KV cache to update past KV

Open Kotomi-Du opened this issue 4 weeks ago • 2 comments

Details:

This PR is to recognize the pattern of ScatterElementUpdate+Slice node(blue nodes in the picture below) and fuse them into multi-stages KVCache node. After fusion, two related changes happened.

  1. ScatteElementUpdate is handled by adding reorder_stage to execute ScatteElementUpdate kernel
  2. Slice is handled by in-place crop by updating the data padding of variableState.

The picture below shows the graph changes before and after fusion. image

Motivation and Context

The Microsoft Phi-Silica application leverages tree-based speculative decoding to accelerate LLM inference. This technique requires frequent manipulation of past KV cache states (e.g. trimming, reordering). This is because only a single branch of the speculative draft tree is accepted after verification.

The current KV Cache API available is OV is very slow which cannot meet MSFT requirements. Details in CVS-174809. As OV team suggested, the only way to support reorder feature is to add specific nodes in the original graph. This PR is to recognize the pattern of added nodes and fuse them into multi-stages KVCache node to be more performant.

Tickets:

CVS-176367

Kotomi-Du avatar Dec 03 '25 19:12 Kotomi-Du

build_jenkins

p-durandin avatar Dec 09 '25 05:12 p-durandin

build_jenkins

Kotomi-Du avatar Dec 10 '25 22:12 Kotomi-Du

Please do not mention customer name in the description. I already updated it.

isanghao avatar Dec 12 '25 11:12 isanghao