aibrix Examples should come with health and readiness checks

🚀 Feature Description and Motivation

Currently, the pod becomes ready immediately, however, the application loading time is still long, at this moment, request to the model server will fail. We used to have such settings but we recently remove them for simplicity.

Use Case

for stable deployment

Proposed Solution

No response

Feb 16 '25 19:02 Jeffwan

please focus on samples folder

Feb 18 '25 00:02 Jeffwan

I'm willing to take this up.

Based on the samples, here's my understanding of the solution for your requirement:

Problem: Pod becomes ready immediately while the application/model is still loading, causing failed requests.
Proposed Solution: Implement health and readiness probes with appropriate delays:

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 120
  periodSeconds: 5
  timeoutSeconds: 1
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 120
  periodSeconds: 5
  timeoutSeconds: 1
  failureThreshold: 5

Key Settings:

120 seconds initial delay to account for model loading time
Same /health endpoint for both probes
Different failure thresholds (3 for liveness, 5 for readiness)

Is my understanding correct that:

Your main issue is premature traffic routing before the model is fully loaded?
The 120-second initial delay would be sufficient for your model loading time?
You're using a setup similar to the samples (vLLM or similar serving framework)?

Please let me know if any of these assumptions need adjustment for your specific use case.

Feb 27 '25 08:02 vivek-orbi

The Quickstart Model Sample already includes checks, but they are too tight for the current model download. 120 seconds is not enough. Going to log an issue and will link it here.

Feb 28 '25 23:02 jolfr

See #772

Feb 28 '25 23:02 jolfr

@vivek-orbi sorry for late response. Are you still interested in this issue? I think @jolfr added some, we probably need to check rest examples to see any example lack of the check.

Your main issue is premature traffic routing before the model is fully loaded?

Yes. pod should become ready only after the application is fully ready.

The 120-second initial delay would be sufficient for your model loading time?

Depends on different user's environment, it's hard to say. We talked about using startupProbe for startup purpose instead of increase delay for liveness and readiness here. Then we can use larger startupProbe and smaller numbers for liveness together. See here https://github.com/vllm-project/aibrix/pull/773.

You're using a setup similar to the samples (vLLM or similar serving framework)?

Yes. All the samples in this repo assume using vLLM.

Apr 28 '25 18:04 Jeffwan