aibrix
aibrix copied to clipboard
Add probe usage practice for super large models, including multi-node case
🚀 Feature Description and Motivation
When we deploy deepseek 671B model using multi-node way, start up takes very long. It brings few problems
- It's better to use
startupProbeandlivenessProbe,readinessProbeto control the interval separately. - ray cluster probe can be managed and injected by ray cluster controller. this is helpful for ray cluster controller to manage raycluster in fault tolerant way. however, we care more about the application status vLLM.
We need to build some practice on this, how to make two mechanisms work together or just use application one instead.
Use Case
fault tolerance and high availability
Proposed Solution
No response