Examples should come with health and readiness checks
🚀 Feature Description and Motivation
Currently, the pod becomes ready immediately, however, the application loading time is still long, at this moment, request to the model server will fail. We used to have such settings but we recently remove them for simplicity.
Use Case
for stable deployment
Proposed Solution
No response
please focus on samples folder
I'm willing to take this up.
Based on the samples, here's my understanding of the solution for your requirement:
- Problem: Pod becomes ready immediately while the application/model is still loading, causing failed requests.
- Proposed Solution: Implement health and readiness probes with appropriate delays:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 5
timeoutSeconds: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 5
timeoutSeconds: 1
failureThreshold: 5
- Key Settings:
- 120 seconds initial delay to account for model loading time
- Same /health endpoint for both probes
- Different failure thresholds (3 for liveness, 5 for readiness)
Is my understanding correct that:
- Your main issue is premature traffic routing before the model is fully loaded?
- The 120-second initial delay would be sufficient for your model loading time?
- You're using a setup similar to the samples (vLLM or similar serving framework)?
Please let me know if any of these assumptions need adjustment for your specific use case.
The Quickstart Model Sample already includes checks, but they are too tight for the current model download. 120 seconds is not enough. Going to log an issue and will link it here.
See #772
@vivek-orbi sorry for late response. Are you still interested in this issue? I think @jolfr added some, we probably need to check rest examples to see any example lack of the check.
- Your main issue is premature traffic routing before the model is fully loaded?
Yes. pod should become ready only after the application is fully ready.
- The 120-second initial delay would be sufficient for your model loading time?
Depends on different user's environment, it's hard to say. We talked about using startupProbe for startup purpose instead of increase delay for liveness and readiness here. Then we can use larger startupProbe and smaller numbers for liveness together. See here https://github.com/vllm-project/aibrix/pull/773.
- You're using a setup similar to the samples (vLLM or similar serving framework)?
Yes. All the samples in this repo assume using vLLM.