aibrix
aibrix copied to clipboard
Cost-efficient and pluggable Infrastructure components for GenAI inference
### 🚀 Feature Description and Motivation Delay scheduling request to avoid over-assignment to some inference engines. We actually have the discussion on the push or pull based solutions. This would...
### 🚀 Feature Description and Motivation RAG and Agent patterns are all multi-thread programs, those application information should be exposed to underneath system to leverage for better colocation etc. ###...
### 🚀 Feature Description and Motivation Currently, we are leveraging the Vineyard Operator to orchestrate workloads. While it provides a foundation, we've extended the upstream operator with advanced scheduling features...
### 🚀 Feature Description and Motivation ``` apiVersion: model.aibrix.ai/v1alpha1 kind: ModelAdapter metadata: name: text2sql-lora-1 namespace: default spec: baseModel: llama2-70b podSelector: matchLabels: model.aibrix.ai: llama2-70b additionalConfig: # could be model artifact etc....
### Summary This aims to expose batch API to users so that they can submit a batch job and retrieve job’s status and results anytime after job submission. However, current...
### 🚀 Feature Description and Motivation Currently, existing large language model (LLM) serving engines that execute multi-turn conversations are inefficient as they need to repeatedly compute the key-value (KV) caches...
### 🚀 Feature Description and Motivation ``` metricsSources: - endpoint: gpu-optimizer.aibrix-system.svc.cluster.local:8080 path: /metrics/aibrix-system/simulator-llama2-7b-a100 metric: "vllm:deployment_replicas" targetValue: "1" ``` In heterogeneous story, `gpu_optimizer` expose an endpoint `/metrics/${namespace}/${scale_target_name}`. Seem here're some issues,...
### 🚀 Feature Description and Motivation Follow up issue here. https://github.com/aibrix/aibrix/issues/600 There's a potential improvement, scheduler should pick up the new pod rather than old pod. Otherwise it will experience...
### 🐛 Describe the bug KPA never scales down after scaling up. Scaling up works but scaling down never happens even when there is 0 load, basically gpu_cache_usage_perc is 0....