Gateway returns not meaningful response when pod is running but container not ready
🐛 Describe the bug
We made few changes in recent weeks to make sure response is explainable. I still see some case not expected today.
Pod Running - {"error":{"code":500,"message":"invalid character 'u' looking for beginning of value"}}%
checking the status
READY STATUS RESTARTS AGE
deepseek-r1-671b-858b4b9569-4w46n-head-2r8p8 0/1 Running
checking gateway logs
E0303 00:05:28.686093 1 gateway.go:502] "error to unmarshal response" err="invalid character 'u' looking for beginning of value" requestID="89b2bf28-e03d-4211-b007-5a1b9eebc8de" responseBody="upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused"
The root problem is we only consider Pod Status but didn't consider container ready or not. If that case, server is ready to serving request but router routes the request to pod and result in failure.
We should fix this issue and add a detailed page on the state machine and result code user may receive.
pod terminating - {"error":{"code":503,"message":"error on getting pods for model deepseek-r1-671b"}}%
pod not exist - {"error":{"code":400,"message":"model deepseek-r1-671b does not exist"}}
ContainerCreating - {"error":{"code":503,"message":"error on getting pods for model deepseek-r1-671b"}}%
Steps to Reproduce
Make sure the pod is ready but use probe to control readiness of container
Expected behavior
In such case, router should not forward request to pod.
Environment
0.2.0
Have a short sync up with @varungup90 I mean pod Running, but ready. Pod will only be ready after all containers are ready.
I think the problem is probably due to the env miss https://github.com/vllm-project/aibrix/pull/776 this change, and it forward request to worker pod. Note, worker use different probe from head.
After applying the change, it works fine
{"error":{"code":503,"message":"error on getting pods for model deepseek-r1-671b"}}
We should improve the logs like pod may not be ready etc.
Per offline discussion, I will close this task and PR. Follow up task is to refactor gateway code where can itemize each check separately and return precise error message rather than a generic one.