aibrix Gateway returns not meaningful response when pod is running but container not ready

🐛 Describe the bug

We made few changes in recent weeks to make sure response is explainable. I still see some case not expected today.

Pod Running - {"error":{"code":500,"message":"invalid character 'u' looking for beginning of value"}}%

checking the status

READY   STATUS              RESTARTS   AGE
deepseek-r1-671b-858b4b9569-4w46n-head-2r8p8                  0/1     Running

checking gateway logs

E0303 00:05:28.686093       1 gateway.go:502] "error to unmarshal response" err="invalid character 'u' looking for beginning of value" requestID="89b2bf28-e03d-4211-b007-5a1b9eebc8de" responseBody="upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused"

The root problem is we only consider Pod Status but didn't consider container ready or not. If that case, server is ready to serving request but router routes the request to pod and result in failure.

We should fix this issue and add a detailed page on the state machine and result code user may receive.

pod terminating - {"error":{"code":503,"message":"error on getting pods for model deepseek-r1-671b"}}%

pod not exist - {"error":{"code":400,"message":"model deepseek-r1-671b does not exist"}}

ContainerCreating - {"error":{"code":503,"message":"error on getting pods for model deepseek-r1-671b"}}%

Steps to Reproduce

Make sure the pod is ready but use probe to control readiness of container

Expected behavior

In such case, router should not forward request to pod.

Environment

0.2.0

Mar 03 '25 00:03 Jeffwan

Have a short sync up with @varungup90 I mean pod Running, but ready. Pod will only be ready after all containers are ready.

Mar 03 '25 19:03 Jeffwan

I think the problem is probably due to the env miss https://github.com/vllm-project/aibrix/pull/776 this change, and it forward request to worker pod. Note, worker use different probe from head.

After applying the change, it works fine

{"error":{"code":503,"message":"error on getting pods for model deepseek-r1-671b"}}

We should improve the logs like pod may not be ready etc.

Mar 04 '25 03:03 Jeffwan

Per offline discussion, I will close this task and PR. Follow up task is to refactor gateway code where can itemize each check separately and return precise error message rather than a generic one.

Mar 04 '25 19:03 varungup90