aibrix issues

Consider to support delay scheduling in Gateway

8

### 🚀 Feature Description and Motivation Delay scheduling request to avoid over-assignment to some inference engines. We actually have the discussion on the push or pull based solutions. This would...

Jeffwan

area/gateway

priority/important-longterm

kind/feature

Support session tracking for LLM request

8

### 🚀 Feature Description and Motivation RAG and Agent patterns are all multi-thread programs, those application information should be exposed to underneath system to leverage for better colocation etc. ###...

Jeffwan

help wanted

area/gateway

priority/important-soon

kind/feature

area/inference-engine

[RFC] Enhancing KV Cache Orchestration for AIbrix Components

2

### 🚀 Feature Description and Motivation Currently, we are leveraging the Vineyard Operator to orchestrate workloads. While it provides a foundation, we've extended the upstream operator with advanced scheduling features...

Jeffwan

kind/enhancement

priority/important-soon

Support different lora adapter artifact registry

9

### 🚀 Feature Description and Motivation ``` apiVersion: model.aibrix.ai/v1alpha1 kind: ModelAdapter metadata: name: text2sql-lora-1 namespace: default spec: baseModel: llama2-70b podSelector: matchLabels: model.aibrix.ai: llama2-70b additionalConfig: # could be model artifact etc....

Jeffwan

area/lora

kind/feature

[RFC]: Batch API for inference job

2

### Summary This aims to expose batch API to users so that they can submit a batch job and retrieve job’s status and results anytime after job submission. However, current...

xinchen384

priority/important-soon

kind/feature

WIP: Add unit test code coverage

1

varungup90

[RFC] Support KV-cache reuse within same multi-turn conversation window

### 🚀 Feature Description and Motivation Currently, existing large language model (LLM) serving engines that execute multi-turn conversations are inefficient as they need to repeatedly compute the key-value (KV) caches...

Jeffwan

kind/enhancement

area/distributed

Gpu optimizer write deployment replica suggestion and autoscaler go through the calculation again

5

### 🚀 Feature Description and Motivation ``` metricsSources: - endpoint: gpu-optimizer.aibrix-system.svc.cluster.local:8080 path: /metrics/aibrix-system/simulator-llama2-7b-a100 metric: "vllm:deployment_replicas" targetValue: "1" ``` In heterogeneous story, `gpu_optimizer` expose an endpoint `/metrics/${namespace}/${scale_target_name}`. Seem here're some issues,...

Jeffwan

kind/bug

priority/critical-urgent

area/heterogeneous

Lora Scheduler should pick up the new generation pod for migration

### 🚀 Feature Description and Motivation Follow up issue here. https://github.com/aibrix/aibrix/issues/600 There's a potential improvement, scheduler should pick up the new pod rather than old pod. Otherwise it will experience...

Jeffwan

kind/enhancement

area/lora

priority/important-longterm

KPA does not scale down after scaling up -> took too long

1

### 🐛 Describe the bug KPA never scales down after scaling up. Scaling up works but scaling down never happens even when there is 0 load, basically gpu_cache_usage_perc is 0....

gangmuk

kind/bug

kind/support

area/autoscaling

area/performance

aibrix
aibrix copied to clipboard

Metadata

Consider to support delay scheduling in Gateway

Support session tracking for LLM request

[RFC] Enhancing KV Cache Orchestration for AIbrix Components

Support different lora adapter artifact registry

[RFC]: Batch API for inference job

WIP: Add unit test code coverage

[RFC] Support KV-cache reuse within same multi-turn conversation window

Gpu optimizer write deployment replica suggestion and autoscaler go through the calculation again

Lora Scheduler should pick up the new generation pod for migration

KPA does not scale down after scaling up -> took too long

← Metadata

Owner

Metadata

aibrix aibrix copied to clipboard

Metadata

← Metadata

Owner

Metadata

aibrix
aibrix copied to clipboard