aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Support session tracking for LLM request

Open Jeffwan opened this issue 10 months ago • 8 comments

🚀 Feature Description and Motivation

RAG and Agent patterns are all multi-thread programs, those application information should be exposed to underneath system to leverage for better colocation etc.

Use Case

No response

Proposed Solution

No response

Jeffwan avatar Feb 05 '25 23:02 Jeffwan

related paper work: https://arxiv.org/html/2502.13965v1

Jeffwan avatar Apr 29 '25 00:04 Jeffwan

This story will take more efforts and we will move it to future release.

Jeffwan avatar Jun 27 '25 01:06 Jeffwan

I have prepared a rough proposal to resolve issue #633. You can find the proposal here: [Google Drive File Link]. Chapter 3 of the proposal specifically discusses two possible implementation approaches I have in mind. I would appreciate any feedback or suggestions from the community. Thanks!

SleepyLGod avatar Jun 27 '25 02:06 SleepyLGod

[RFC] Session-Aware Scheduler and Context Cache Manager Plugins to AIBrix Gateway for Advanced LLM Serving

Corresponding issues: #633, #1248.

Reference of the Context Cache Part: PR #1300 by @zhengkezhou1 .

The old version of the RFC: GIST LINK.

Summary

This RFC proposes the integration of a high-performance, session-aware scheduler and a session-aware context cache manager directly into the AIBrix Gateway. The new component will manage the lifecycle of LLM requests based on their session context, implementing an advanced scheduling algorithm inspired by the Autellix paper (ATLAS/PLAS), as well as a session-context reuse mechanism. The goal is to intelligently deal with Head-of-Line Blocking problems in session level and optimize resource utilization for multi-request, conversational workloads, thereby improving throughput and tail latency under high load.

Motivation

As AIBrix is increasingly used for complex, multi-turn GenAI applications (e.g., Agents, RAG pipelines, conversational AI), the default request-level routing, while efficient, faces two key challenges:

  • Head-of-Line Blocking: Under high load, long-running requests (e.g., long-context summarization) can occupy inference slots, causing short, interactive requests (e.g., chatbot turns) to experience high latency. This degrades the user experience for latency-sensitive applications.
  • Poor Utilization of Session Context: The current gateway treats each request in isolation. It is unaware that a series of requests belong to the same user session, leading to repeated computation for frequently used prompts and higher time-to-first-token latency.

By introducing the session-aware scheduler and context cache manager, AIBrix can transition from a simple request dispatcher to an intelligent workload orchestrator. This change will enable AIBrix to:

  • Reduce the latency for short, interactive sessions by prioritizing them over long-running ones.
  • Reduce the e2e latency in the session level by making good use of the session context cache.
  • Increase overall system throughput.
  • Provide fairness across different user sessions.

Proposed Change

The proposed change involves creating a new scheduler plugin within the AIBrix Gateway. This scheduler is in tight integration with the Gateway's existing architecture.

You can check this slides for the basic architecture: [Slides link (updating)].

Core Workflow:


sequenceDiagram

participant C as Client

participant E as Envoy

participant G as Gateway Plugin

participant CCM as Context Cache Manager

participant Sched as Scheduler

participant R as Router

participant V as vLLM Engine

participant P as Persistent KV Cache Store

  

Note over C,P: Creating a new Context Cache Session with Scheduling

C->>+E: 1. POST /v1/context (prompt, ttl...)

E->>+G: 2. Forward Request

G->>+CCM: 3. Initiate new session

Note over CCM,Sched: CCM generates ID and submits job to Scheduler

CCM->>CCM: 4. Generate session-id

CCM->>+Sched: 5. SubmitJob(session-id)

Note over Sched: New session (CST=0), gets highest priority

Sched-->>-CCM: 6. Decision: GO (Permission granted)

  

CCM->>+R: 7. Request routing decision

R->>R: 8. Select least-loaded pod (e.g., Pod-A)

R-->>-CCM: 9. Return Pod-A metadata

  

Note over CCM,V: CCM coordinates inference

CCM->>+V: 10. Inference Request (to Pod-A)

V->>V: 11. Execute full inference (Prefill + Decode)

V->>+P: 12. Offload generated KV Cache to Storage

P-->>-V: 13. Acknowledge cache saved

V-->>-CCM: 14. Return output & cache metadata (location, size)

Note over CCM: CCM finalizes session state

CCM->>CCM: 15. Update SessionState (location=Pod-A)

CCM->>+Sched: 16. FinalizeJob(session-id, exec_time...)

Sched-->>-Sched: 17. Update scheduling stats (CST, etc.)

CCM-->>-G: 18. Return final response

G-->>-E: 19. Pipe back response (with x-session-id)

E-->>-C: 20. Complete Response

sequenceDiagram

participant C as Client

participant E as Envoy

participant G as Gateway Plugin

participant CCM as Context Cache Manager

participant Sched as Scheduler

participant R as Router

participant V as vLLM Engine

participant P as Persistent KV Cache Store

  

Note over C,P: Using an existing session with Scheduling

C->>+E: 1. POST /v1/completions (x-session-id, prompt...)

E->>+G: 2. Forward Request

G->>+CCM: 3. Handle session request

Note over CCM,Sched: Request must wait for Scheduler's permission

CCM->>+Sched: 4. SubmitJob(session-id)

Sched->>Sched: 5. Enqueue job in PriorityQueue<br/>(based on session's CST)

Note right of Sched: Job waits...<br/>( scheduler loop pops it when<br/>it has the highest priority<br/>AND cluster has capacity )

Sched-->>-CCM: 6. Decision: GO (Permission granted)

  

Note over CCM,R: Now with permission, proceed to routing

CCM->>CCM: 7. Get Cache Location (e.g., Pod-A) from SessionState

CCM->>+R: 8. Request routing (with affinity hint: Pod-A)

R-->>-CCM: 9. Return final routing decision (likely Pod-A)

  

alt âś… Cache is HOT in Pod-A's GPU

CCM->>+V: 10a. Inference Request (to Pod-A)<br/>(with new tokens only)

else ⚠️ Cache needs to be loaded (e.g., Pod-A restarted)

CCM->>+P: 10b. Command: Load cache for session-id

P-->>V: 11b. Stream KV Cache to Pod-A's GPU

V->>+CCM: 12b. Acknowledge cache is ready

CCM->>+V: 13b. Inference Request (to Pod-A)

end

  

V->>V: 14. Execute incremental inference

V->>+P: 15. Offload updated KV Cache to Storage

P-->>-V: 16. Acknowledge cache updated

V-->>-CCM: 17. Return output & updated cache metadata

Note over CCM: Finalize the job

CCM->>+Sched: 18. FinalizeJob(session-id, exec_time...)

Sched->>Sched: 19. Update session's CST

CCM-->>-G: 20. Return final response

G-->>-E: 21. Pipe back response

E-->>-C: 22. Complete Response

The scheduler operates through a 4-phase request processing pipeline:

  • Phase 1: Request Interception & Job Submission

    • Flow: Client Request → Gateway Process() → Extract/Generate SessionID → CCM receives request → CCM submits Job to Scheduler → Gateway goroutine blocks, awaiting scheduling permission.
    • Description: All incoming stateful requests (e.g., /v1/context or /v1/completions with an x-session-id) are intercepted by the Gateway. The CCM takes control, generates a new session ID if one is not provided, and immediately submits a job to the Scheduler. The request's goroutine then pauses, effectively entering a system-wide queue where it will wait for the Scheduler's explicit permission to proceed.
  • Phase 2: Intelligent Scheduling Decision (Fairness-Driven)

    • Flow: Scheduler's processingLoop → Prioritization via ATLAS/PLAS (using session's CST) → Dynamic Batch Size Calculation (based on cluster capacity - inflight) → Pop Highest-Priority Jobs from Heap → Dispatch "GO" Decision.
    • Description: The Scheduler, acting as a fairness advisor, processes its internal priority queue. Its event-driven loop continuously evaluates which waiting jobs should be processed next based on two conditions: their priority (determined by the session's historical CriticalPathServiceTime) and the real-time capacity of the backend cluster. It selects a batch of the highest-priority jobs and dispatches a "GO" decision, unblocking their respective goroutines.
  • Phase 3: Context-Aware Routing & Forwarding (Performance-Driven)

    • Flow: Gateway goroutine unblocks → CCM queries SessionState for Cache Location → CCM provides strong affinity hint to Router → Router selects optimal Pod → CCM ensures KV Cache is hot (loads from persistent store if necessary) → Gateway forwards request to backend.
    • Description: Once a request receives its "GO" signal from the scheduler, the CCM queries its internal state to find the physical location of the session's KV Cache (e.g., Pod-A). This location is passed to the Router as a strong affinity hint. The Router makes the final pod selection, and the CCM coordinates with the target pod's AI Runtime to ensure the KV Cache is loaded into GPU memory before the gateway forwards the final inference request.
  • Phase 4: Completion & State Reconciliation

    • Flow: Gateway receives full Response → Gateway measures execution/wait times → Gateway calls scheduler.FinalizeJob() → Scheduler updates session's CST and scheduling stats → CCM coordinates with vLLM to persist the updated KV Cache.
    • Description: Upon completion of the request, the Gateway performs the final bookkeeping. It reports the measured performance metrics to the Scheduler by calling FinalizeJob(), allowing the scheduler to update the session's priority score (CST) for future requests. Simultaneously, the CCM coordinates with the vLLM engine to ensure the session's updated KV Cache (which now includes the latest turn) is asynchronously offloaded back to the persistent storage. This closes the loop, reconciling both the logical (scheduling) and physical (cache) state of the session.

TODOs:

  • [x] Architectural Design: The scheduler will be built on a high-performance, non-blocking architecture:

    • [X] MPSC (Multi-Producer, Single-Consumer) Channel: A buffered Go channel (submitChan) will serve as the single ingress point for all incoming requests. Multiple concurrent Process goroutines (producers) can submit jobs to this channel in a lock-free manner.
    • [X] Actor Model / Single-Writer Principle: A single, dedicated goroutine (processingLoop) will be the sole consumer of the submitChan and the exclusive owner of the core PriorityQueue.
    • [X] Event-Driven Loop: The processingLoop will be driven by multiple events (new job arrival, job completion signals, and a periodic ticker), ensuring both low latency for new requests and high throughput for clearing the queue.
  • [X] Core Algorithm: Unified ATLAS/PLAS with Anti-Starvation: The scheduler implements a unified version of the session-aware scheduling algorithms in the Autellix paper.

    • [X] Priority Metric: The core priority for any request will be its session's CriticalPathServiceTime(CST)—a measure of the cumulative execution time of the longest chain of requests in that session. Shorter CST means higher priority.
    • [X] Unified Logic: This single metric naturally handles both single-thread request sessions (PLAS) and multi-thread request sessions (ATLAS).
    • [X] Anti-Starvation (long sessions): The priority queue's comparison logic will include an anti-starvation mechanism. If a job's TotalWaitTime / InheritedCST exceeds a configurable threshold, its priority will be significantly boosted to ensure it eventually gets processed.
  • [ ] State Management: In-Memory Session Cache -> Context Cache Manager (CCM)

    • [ ] Session & Context Cache

      • [x] MutexSessionCache: A thread-safe, in-memory cache (*map[string]SessionState) will be used to store the state of all active sessions (CST, TotalWaitTime, etc.).
      • [ ] A thread-safe, in-memory cache will be the core of the Context Cache Manager (CCM). It will store the unified SessionState, containing both scheduling metadata (CST) and physical cache metadata (location, TTL).
    • [ ] Lifecycle Management:

      • [x] The cache will feature a background goroutine to periodically clean up stale sessions that have been inactive for a configurable duration, preventing memory leaks.
      • [ ] The CCM will manage the full lifecycle of contexts. It will expose methods for explicit creation and deletion (via the new /v1/context endpoint) and handle automatic cleanup based on the user-provided TTL.
    • [ ] Integration:

      • [x] An instance of the SessionCache will be created in the Gateway's NewServer() function and shared with both the Scheduler (for reading priority info) and the Gateway's Process state machine (for triggering state updates).
      • [ ] An instance of the CCM will be created in NewServer(). It will be the central coordinator. The Scheduler and the Gateway's Router will both hold a reference to the CCM to query session state.

      Current Session Table:

      type SessionState struct {
          SessionID               string        // index
          CriticalPathServiceTime time.Duration // ATLAS/PLAS cumulative session service time
          TotalWaitTime           time.Duration // anti-starvation for long session
          PodAffinity             string        // optional now (used for routing)
          LastActivityTimestamp   time.Time     // session auto-clean-up
      }
      

      Advanced Session Cache Manager Design:

      // Core idea: Separate data with different lifecycles into different structures
      // And an alternative way is: 
      // instead of using a huge 'map[string]*SessionState' inside the CCM,
      // we can maintain two smaller maps 'map[string]*SchedulingInfo' and 'map[string]*ContextInfo'.
      type SessionState struct {
          SessionID string `json:"session_id"`
      
          // Scheduling-related data (system TTL)
          SchedulingInfo *SchedulingInfo `json:"scheduling_info,omitempty"`
      
          // Context Cache-related data (user TTL / system TTL)
          ContextInfo *ContextInfo `json:"context_info,omitempty"`
      }
      
      // Previous session table
      type SchedulingInfo struct {
          CriticalPathServiceTime time.Duration `json:"critical_path_service_time"` 
          TotalWaitTime           time.Duration `json:"total_wait_time"`
          LastSchedulingActivity  time.Time     `json:"last_scheduling_activity"`
          ... (e.g., RequestCount, TotalTokensProcessed)
      }
      
      // Context Cache
      type ContextInfo struct {
      	// lifecycle management
          ExplicitlyCreated bool          `json:"explicitly_created"` // by "/v1/context" or not
          UserSpecifiedTTL  time.Duration `json:"user_specified_ttl"`
          CreationTime      time.Time     `json:"creation_time"`
          LastContextAccess time.Time     `json:"last_context_access"`
          ...
      
          // Context metadata
          ContextSize    int64  `json:"context_size"`
          ContextVersion int64  `json:"context_version"`
          ...
      
          // cache location (integration with prefix cache)
          CacheLocation     string        `json:"cache_location"`
          CacheStatus       CacheStatus   `json:"cache_status"`
          CacheValidUntil   time.Time     `json:"cache_valid_until"`
          ...
      
          // session specific prefix cache info
          // we're not gonna create a local prefix cache for each session, 
          // we just utilize the global prefix cache hashtable
          SessionPrefixInfo *SessionPrefixInfo `json:"session_prefix_info,omitempty"`
      }
      
      type SessionPrefixInfo struct {
      	// current cumulated prefix cache hashes
      	AccumulatedPrefixHashes []uint64 `json:"accumulated_prefix_hashes"`
      
      	// last time's matching info 
      	LastMatchedPod string `json:"last_matched_pod"`
      	LastMatchPercent int `json:"last_match_percent"`
      	LastPrefixMatch time.Time `json:"last_prefix_match"`
      
      	// incrementally compute prefix cache hash
      	TokenSequence []byte `json:"token_sequence,omitempty"` // optional
      	CurrentTokenLength int `json:"current_token_length"`
      	...
      	}
      
      type CacheStatus int
      const (
          CacheStatusUnknown CacheStatus = iota
          CacheStatusHot     // GPU Mem
          CacheStatusWarm    // Local Storage, fast-loading
          CacheStatusCold    // Remote Storage, slow-loading
          CacheStatusLoading // Loading
          CacheStatusInvalid // Invalid
          ...
      )
      

      Corresponding Interface Draft Design:

      type SessionCache interface {
          // current methods
          GetOrCreateForScheduler(sessionID string) (time.Duration, time.Duration)
          UpdateState(sessionID string, inheritedCST, executionTime, waitTime time.Duration)
      
          // context cache management
          CreateExplicitContext(sessionID string, ttl time.Duration) error
          GetContextInfo(sessionID string) (*ContextInfo, bool)
          UpdateContextAccess(sessionID string) error
      
          // for routing
          GetSessionRoutingHint(sessionID string) ...
          UpdateCacheLocation(sessionID, podName string, status CacheStatus) error
          UpdateSessionPrefix(sessionID string, newTokens []byte, selectedPod string, ...) error
          ...
          // OR: more atomic 'get' methods
      
          // lifecycle management
          Cleanup(schedulingTimeout, ... time.Duration) (stop func())
          ...
      }
      
  • [X] Load-Aware Batching & Graceful Degradation: To prevent backend engine starvation while avoiding overload, the scheduler will implement intelligent, dynamic batching.

    • [X] Dynamic Batch Size: On each scheduling cycle, the scheduler calculates a batchSize representing the number of requests it can release.
    • [X] Strategy Pattern with Graceful Degradation: By default, it integrates with AIBrix's existing cache and loadProvider to get real-time Pod utilization metrics. The batchSize (i.e., how many requests are planned to be popped out to the router) is precisely calculated based on the actual available capacity of the cluster.
    • [X] Improved Fallback Strategy: If the advanced load provider is unavailable, the scheduler gracefully degrades. It calculates capacity based on Pod annotations (aibrix.io/max-concurrent-requests) and the current number of in-flight requests. This provides a robust, intelligent baseline even without real-time metrics.
    • [X] Pass-through Mode: If no capacity information is available at all (e.g., no annotations), the scheduler enters a "pass-through" mode by setting a very large batchSize, like falling back to the logic of no scheduler component. This effectively disables throttling and prioritizes low latency, handing off backpressure responsibility to the backend engines.
    • [X] Decision Smoothing: An Exponential Moving Average (EMA) is applied to the calculated batchSize to smooth out fluctuations caused by noisy metrics, leading to more stable and predictable system throughput.
  • [ ] Integration with Gateway State Machine: The new scheduler and CCM will be seamlessly integrated into the Gateway's gRPC stream processing loop using a state machine pattern:

    • [ ] Submission:
      • [x] After receiving the RequestHeader, the Process goroutine will asynchronously submit a job to the scheduler and transition to an stateAwaitingDecision. The goroutine then waits on a channel for the scheduler's decision.
      • [ ] After receiving the RequestHeader (or on a POST /v1/context call), the Process goroutine passes the request context to the CCM. The CCM is responsible for creating the session state and then submitting the job to the Scheduler. The Process goroutine waits for the scheduler's decision.
    • [x] Dispatch: The scheduler's processingLoop eventually selects the job and sends a Decision back through the job's channel.
    • [ ] Routing:
      • [x] The awakened Process goroutine receives the "go-ahead", performs the final routing via selectTargetPod, and transitions to stateForwarding.
      • [ ] The awakened Process goroutine receives the "go-ahead". It then asks the CCM for the session's affinity hint (GetCacheLocation). This hint, along with other metrics, is used by selectTargetPod for the final routing decision. The Gateway then informs the CCM of the chosen Pod, allowing the CCM to coordinate any necessary cache loading.
        // expand current router to support pod affinity hints
        type SessionAwareRouter interface {
            types.Router
            ...
            // new
            RouteWithSessionHint(ctx *types.RoutingContext, 
            readyPodList types.PodList, 
            sessionHint *SessionRoutingHint) (string, error)
        }
        
        // Session routing hints (optional para)
        type SessionRoutingHint struct {
            // from session cache
            PreferredPod      string   `json:"preferred_pod"`
            SessionPrefixInfo *SessionPrefixInfo `json:"session_prefix_info,omitempty"`
        
            // integration with the global hashtable
            GlobalPrefixHashes []uint64 `json:"global_prefix_hashes,omitempty"`
        }
        
        type sessionAwarePrefixCacheRouter struct {
            // current router
            prefixCacheRouter
            // new
            sessionCache SessionCache
        }
        
        func (r *sessionAwarePrefixCacheRouter) RouteWithSessionHint(
            ...) (string, error) {
            ...
            return r.routeWithSessionAwareness(..., sessionHint)
        }
        
        func (r *sessionAwarePrefixCacheRouter) routeWithSessionAwareness(
            ...) (string, error) {
            ...
            return ctx.TargetAddress(), nil
        }
        
    • [ ] Finalization:
      • [x] Upon request completion (e.g., ResponseBody with completed=true), the Process goroutine calls scheduler.FinalizeJob(), providing the necessary timing information to update the session's state and decrement the in-flight request counter.
      • [ ] Upon request completion, the Process goroutine calls scheduler.FinalizeJob() as before. Additionally, it informs the CCM that the request is complete, allowing the CCM to trigger the asynchronous offloading of the updated KV Cache to persistent storage.

Alternatives Considered

1. Integration session info with Gateway's routers (by Le and Jiaxin): Another way of the integration is: Session-Aware Prefix Entries: Moving session metadata into the global prefix cache. Each prefix node could maintain a list of sessions that frequently traverse it. This may enable more intelligent routing decisions, such as co-locating requests from different sessions that share a common long prefix.

2. Pull-based vs. Push-based Request Distribution (by Le): The proposed design retains AIBrix Gateway's push-based model, where the scheduler proactively pushes approved requests to downstream pods. We considered a pull-based alternative, where downstream engine sidecars would actively pull requests from a central queue in the Gateway.

  • Rationale for Pull-based: A pull model under the unpredictable LLM serving scenarios could theoretically improve resource utilization, as an idle engine pod could immediately fetch new work, minimizing idle time. This is particularly compelling for workloads with high variance in task duration.
  • Reason for Deferral: Implementing a pull-based model would require substantial engineering changes to both the Gateway and the AI Runtime Sidecar. It introduces new failure modes (e.g., a sidecar failing to pull) and complexities in managing a distributed queue. The current push-based model, when combined with our load-aware scheduler, already provides a robust mechanism to keep engines fed, making the significant cost of switching to a pull model a lower priority for the initial implementation.

3. Engine-Level Scheduling (The Autellix Approach): The original Autellix paper implemented its scheduling and preemption logic directly within the local inference engine. This allows for fine-grained, iteration-level preemption (e.g., swapping KV Caches mid-generation) to resolve Head-of-Line (HoL) blocking at the most granular level.

  • Rationale: This approach offers the highest possible performance, as the engine has perfect, real-time knowledge of its internal state. It also creates a clean separation of concerns: the Gateway handles intelligent routing, while the engine handles local scheduling.
  • Reason for Deferral: Modifying core inference engines like vLLM is a highly invasive and complex task that would create a maintenance burden and deviate from the upstream projects. Furthermore, engine-level scheduling only achieves a local optimum; it cannot make globally optimal decisions based on the state of the entire cluster or the shared SessionCache. Our in-gateway approach provides a global view, which is a key advantage, even if it cannot perform iteration-level preemption.

4. Time-based vs. Resource-based Fairness (by Le): This is a research question: fairness in serving scheduling: resources or time? This direction also requires further discussion.

  • The Dilemma: Suppose there are two pods: Pod1's engine1 is currently processing one request, and Pod2's engine2 is processing four requests. Two new requests arrive simultaneously: RequestA from session1 (higher priority), and RequestB from session2. The scheduler prioritizes RequestA based on Pod1's lower apparent load (fewer active requests) and assigns it to Pod1. RequestB is then routed to Pod2. However, the unpredictable nature of request durations presents a fundamental challenge: existing requests in Pod1 may be long-running and resource-intensive, delaying RequestA despite its lower resource requirements. Meanwhile, Pod2's four requests complete quickly, allowing RequestB to complete more quickly.
  • As a result, despite Pod1's actual resource consumption being minimal, Session1's total service time for RequestA is disproportionately longer. This results in Session1's cumulative priority (e.g., based on critical path service time in PLAS/ATLAS) being lower than that of Session2, reversing the intended order. This situation highlights the tension between time-based fairness and resource efficiency, potentially violating SLOs in heterogeneous environments.

5. Decoupled Metadata Registry vs. Integrated Orchestrator for Context Management (by @zhengkezhou1 )

An alternative architectural approach suggests that the Context Cache Manager (CCM) be implemented as a lightweight, independent microservice—a Session Metadata Registry—rather than an integrated orchestrator within the Gateway plugin.

  • Core Idea (Registry Model): In this model, the CCM acts as a passive, centralized "directory service." Its sole responsibility is to manage a lightweight mapping between a session_id and a kv_cache_sidecar_ref (a pointer to the physical KV Cache), along with other metadata like TTL. It does not control the request flow, handle data transfer, or perform scheduling. The Gateway plugin becomes the active coordinator, querying this registry for metadata and then orchestrating interactions with the KV Cache Sidecar and vLLM Engine.

  • Essential Nature: This approach champions a maximalist separation of concerns, where the CCM is purely a state registry, decoupled from the control plane's request lifecycle.

  • Key Distinction from Our Proposed Design:

    • Role: Our proposed CCM is an Orchestrator, actively controlling the request flow through its integrated Scheduler. The alternative CCM is a Registry, passively providing data upon request.
    • Control Flow: Our design introduces a blocking scheduling point within the CCM, enforcing fairness globally. The alternative design has no such mechanism; the Gateway queries and proceeds, making it a performance-optimization feature without fairness guarantees under contention.
    • Location: Our CCM is tightly integrated within the Gateway for minimal latency. The alternative CCM is a standalone service, introducing network latency for every stateful operation.

A New Unified Strategy: Orchestrator with a Pluggable State Backend

The optimal solution is a synthesis of both designs: retain our Integrated Orchestrator model within the Gateway but abstract its state management backend: The ContextCacheManager (the Orchestrator) in the Gateway does not directly own the state (e.g., in an in-memory map). Instead, it interacts with a SessionStateProvider interface. This interface can have multiple implementations: a default in-memory version (our current plan), and a more robust version that communicates with a dedicated SessionMetadataService (the Registry).

sequenceDiagram
    participant C as Client
    participant G as "Gateway (with Integrated CCM & Scheduler)"
    participant SSP as "SessionStateProvider (e.g., Metadata Service Client)"
    participant R as Router
    participant Sidecar as "KV Cache Sidecar"
    participant V as vLLM Engine

    Note over C,V: Unified Workflow: Orchestrator in Gateway, State in Backend
    
    C->>+G: Request (with session-id)
    
    Note over G: CCM's internal Scheduler is the entry point.
    G->>G: 1. CCM's Scheduler receives the job
    
    Note over G, SSP: Scheduler queries the state provider for metadata.
    G->>+SSP: 2. GetSchedulingInfo(session-id)
    SSP-->>-G: 3. Return CST, etc.
    
    Note over G: Job waits in PriorityQueue... then gets permission (GO decision)
    
    Note over G, R: Now, CCM coordinates routing.
    G->>+SSP: 4. GetRoutingHint(session-id)
    SSP-->>-G: 5. Return Cache Location Ref
    
    Note over G, R: CCM provides hint to its internal Router.
    G->>G: 6. Router selects optimal pod (e.g., Pod-A)
    
    Note over G, Sidecar: CCM coordinates cache loading.
    G->>+Sidecar: 7. EnsureCacheReady(Cache Ref)
    Sidecar-->>-G: 8. Cache is HOT
    
    G->>+V: 9. Inference Request
    
    Note over G, SSP: Finalization.
    V-->>G: 10. Response
    G->>G: 11. CCM's Scheduler calculates state updates
    G->>+SSP: 12. UpdateSessionState(session-id, updates)
    SSP-->>-G: 13. Acknowledge update```

SleepyLGod avatar Aug 28 '25 17:08 SleepyLGod

hi @SleepyLGod, is there anything repetitive? i think we can work together :)

zhengkezhou1 avatar Sep 04 '25 03:09 zhengkezhou1

hi @SleepyLGod, is there anything repetitive? I think we can work together :)

@zhengkezhou1 Sure, and to me, these two issues are of the same magnitude: incorporating the concept of sessions into the gateway, just that I'm adding a system-oriented, logically fair scheduling algorithm, while you're applying the concept of sessions to prefix-cache-aware routing (a physical caching framework). I really agree with your robust and empirical design in your proposal.

For now, I don't have a completely detailed idea yet. My tentative design, as proposed in this recently updated RFC, is to have the session cache manager (I previously implemented a very simple session table to record some latency metadata for just scheduling) store and manage both scheduler and router data. This would make the scheduler a client of the manager, responsible for ordering decisions. Simultaneously, the router would also be a client of the manager, responsible for locating decisions.

Please allow me to reflect on this for a moment, and we can discuss this in more detail later, as these two issues are indeed closely related.

SleepyLGod avatar Sep 04 '25 04:09 SleepyLGod

/cc

googs1025 avatar Sep 04 '25 05:09 googs1025

I think the final proposed approach is good. The SessionStateProvider can ensure metadata consistency via remote storage in multi-replica and autoscaling scenarios, while an in-memory implementation is more convenient for testing.

  • Router: decides which inference pod to send a request to using algorithms such as least-kv-cache and least-busy-time.
  • Scheduler: blocks all incoming requests and schedules/releases them based on priority.
  • Context Cache Manager: receives requests released by the Scheduler and loads the KV cache into the GPU memory of the pod selected by the Router.

zhengkezhou1 avatar Sep 11 '25 06:09 zhengkezhou1