[RFC]: Sync KV cache from inference engine to gateway

Open varungup90 opened this issue 8 months ago • 0 comments

🚀 Feature Description and Motivation

The gateway currently caches the input request along with the pod it was routed to. This cached information is then used to perform prefix matching for future requests, allowing them to be routed to the same pod that already holds the relevant KV cache locally.

However, gateway or pod restarts can potentially disrupt this mapping. The goal is to maintain near real-time synchronization of KV cache information from the pod to the gateway to ensure consistency.

For reference: dynamo uses pub-sub to sync the state.

Use Case

Proposed Solution

No response

Apr 04 '25 18:04 varungup90