Jiaxin Shan issues

Results 271 issues of


                                            Jiaxin Shan

Support debug port in gateway plugin

### 🚀 Feature Description and Motivation service discovery is commonly misconfigured by users. However, we do not have every straightforward ways to help user diagnose the problem. The most easiest...

[Discussion] Simplify AIBrix deployment by removing Envoy Gateway

### 🚀 Feature Description and Motivation Today AIBrix recommends/ships Envoy Gateway, but we typically run a `single` gateway instance. Envoy Gateway adds installation and controller complexity that many users don’t...

[batch] Build resource provider layer for batch workloads

### 🚀 Feature Description and Motivation Now, we need to launch kubernetes jobs and host GPU models. In order to support broader customers. We need to consider other resource providers....

area/batch

[metadata] provide better abstraction for redis client

### 🚀 Feature Description and Motivation see discussion here https://github.com/vllm-project/aibrix/pull/1639#discussion_r2418152646 Let's build better abstraction instead of directly reference a redis client. At the same time, storage has a redis implementation...

engine prefix should be provided by user and we should not amend `vllm:`

### 🐛 Describe the bug ``` W0930 18:13:41.031286 1 fetcher.go:99] Failed to fetch metric vllm:gpu_cache_usage_perc from pod default/mock-llama2-7b-7cc98b7f5f-764t4: metric vllm:gpu_cache_usage_perc not found in central registry. Returning zero value. ``` ###...

area/autoscaling

refine the user management API

### 🚀 Feature Description and Motivation https://github.com/vllm-project/aibrix/pull/1639#discussion_r2417964859 The current user management API (/CreateUser, /ReadUser, etc., all using POST) is a direct migration from the previous Go service. While functional, it...

hpa reconcile keep generating the new recommendation

### 🐛 Describe the bug 1. When we update the hpa configuration, it will enqueue the objects and immediately generate the new recommendations. 2. Controller normally updates the CR object...

area/autoscaling

Add Circuit Breaker Policy for HPA on Bad Metrics

### 🚀 Feature Description and Motivation Currently, when PodAutoscaler (HPA/KPA/APA) receives abnormal or invalid metrics (e.g., NaN, outliers, sudden spikes) or unexpected behaviors like error rate going up etc, it...

area/autoscaling

[Feat] Support StormService pause rollout in upgrade

## Pull Request Description [Feat] Support StormService pause rollout in upgrade * Update stormservice golang client * Improve the test coverage * Refactor the API to support manual resume ##...

Verify custom metrics fetch working and implement external metrics support

### 🚀 Feature Description and Motivation Our autoscaling framework already has multiple MetricFetcher implementations: - RestMetricsFetcher → direct pod /metrics endpoint - ResourceMetricsFetcher → Kubernetes resource metrics (cpu, memory) -...

area/autoscaling