aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Add probe usage practice for super large models, including multi-node case

Open Jeffwan opened this issue 10 months ago • 0 comments

🚀 Feature Description and Motivation

When we deploy deepseek 671B model using multi-node way, start up takes very long. It brings few problems

  1. It's better to use startupProbe and livenessProbe, readinessProbe to control the interval separately.
  2. ray cluster probe can be managed and injected by ray cluster controller. this is helpful for ray cluster controller to manage raycluster in fault tolerant way. however, we care more about the application status vLLM.

We need to build some practice on this, how to make two mechanisms work together or just use application one instead.

Use Case

fault tolerance and high availability

Proposed Solution

No response

Jeffwan avatar Mar 03 '25 00:03 Jeffwan