[RFC]: Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing.

Open zhangjyr opened this issue 10 months ago • 0 comments

Summary

Refactoring for cache:

Merge multiple pod, model, and metric mapping by adding Pod metadata and Model metadata and using two main thread-safe registries for metadatas.
Eliminate the global cache mutex lock and replace it with multiple layers of locks on in the metadata. Refactoring for router
Eliminate thread-unsafe map access in router interface.
Merge two contexts, context.Context and routing.RoutingContext, as RoutingContext
Add queue router to enable per-model request reordering
Abstract away router interface and RoutingContext for shared access from both routing and cache package.

Motivation

Concurrency safety concern for cache and routing interaction:

type Router interface {
	// Route returns the target pod
	Route(ctx context.Context, pods map[string]*v1.Pod, routingCtx RoutingContext) (string, error)
}
type Cache struct {
	mu                sync.RWMutex
        ...
	ModelToPodMapping map[string]map[string]*v1.Pod   // model_name: map[pod_name]*v1.Pod
        ...
}

As shown above, the router interface uses thread-unsafe map[string]*v1.Pod which is stored in another thread-unsafe ModelToPodMapping in cache object. On updating cache pods, golang can raise map concurrent access fault.

Concurrency performance concern for cache:

type Cache struct {
	mu                sync.RWMutex
	...
	metrics           map[string]interface{}
	ModelMetrics      map[string]map[string]interface{}
	Pods              map[string]*v1.Pod
	PodMetrics        map[string]map[string]metrics.MetricValue            // pod_name: map[metric_name]metric_val
	PodModelMetrics   map[string]map[string]map[string]metrics.MetricValue // pod_name: map[model_name]map[metric_name]metric_val
	PodToModelMapping map[string]map[string]struct{}                       // pod_name: map[model_name]struct{}
	ModelToPodMapping map[string]map[string]*v1.Pod                        // model_name: map[pod_name]*v1.Pod
        ...
	pendingRequests   *sync.Map                                            // model_name: *int32
}

The current cache has eight maps to maintain the pod-model relationship and related metadata, such as metrics. More metadata might be added to the cache to support stateful routing. Redesign is imminent.

Router interface redesign:

The multiple context object is redundant in the routing interface.
The current routing policy supports FIFO routing only; we showcase a simple way to add pluggable request queue support that allows request reordering.

Proposed Change

As shown in the UML, we propose:

Using Pod and Model to store all metadata previously maintained in eight maps.
Cut eight thread-unsafe cache global maps to 2 sync.Map wrappers. (Ignoring ModelGPUProfile for now)
Redefine router interface for: a. merge request context (context.Context) and routing context (RoutingContext) b. using array-like PodArray to replace pod map
PodArray supports deployment-based heterogenous GPUs.
Add new APIs to the routing context to support request reordering.
Add a queue router to showcase a pluggable, stateful, per-model router.

Alternatives Considered

No response

Mar 14 '25 23:03 zhangjyr