aibrix
aibrix copied to clipboard
[RFC]: Sync KV cache from inference engine to gateway
🚀 Feature Description and Motivation
The gateway currently caches the input request along with the pod it was routed to. This cached information is then used to perform prefix matching for future requests, allowing them to be routed to the same pod that already holds the relevant KV cache locally.
However, gateway or pod restarts can potentially disrupt this mapping. The goal is to maintain near real-time synchronization of KV cache information from the pod to the gateway to ensure consistency.
For reference: dynamo uses pub-sub to sync the state.
Use Case
NA
Proposed Solution
No response