Pod status in kubectl get pod is wrong
🐛 Describe the bug
This is the screenshot of kubectl get pod. You can see the STATUS is 'running' even if the pod is not actually ready (see the READY column) It can mess up many things, for example sending requests to not-ready pod or even worse.
Suspicion: I think aibrix has weird readiness probe logic. maybe it is only considering init container not all container.
When 1/2 containers is ready, the pod status is running too.
k describe pod aibrix-model-deepseek-llm-7b-chat-6d879fd5-z4rfx
Name: aibrix-model-deepseek-llm-7b-chat-6d879fd5-z4rfx
Namespace: default
Priority: 0
Service Account: default
Node: 10.0.0.44/10.0.0.44
Start Time: Fri, 24 Jan 2025 14:22:41 -0800
Labels: model.aibrix.ai/name=deepseek-llm-7b-chat
pod-template-hash=6d879fd5
Annotations: kubectl.kubernetes.io/restartedAt: 2025-01-24T14:17:51-08:00
prometheus.io/path: /metrics
prometheus.io/port: 8000
prometheus.io/scrape: true
vke.volcengine.com/cello-pod-evict-policy: allow
Status: Running
IP: 10.0.0.59
IPs:
IP: 10.0.0.59
Controlled By: ReplicaSet/aibrix-model-deepseek-llm-7b-chat-6d879fd5
Init Containers:
init-model:
Container ID: containerd://711107184fd8812c8cf4c0e6a3b841c7b640701cd640c8da59fbf7ec40d7ac6b
Image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.1
Image ID: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime@sha256:e89f2714affcb7ca4204f9a11c6e9e2d99edeb003c550a68ed0f70e80865160c
Port: <none>
Host Port: <none>
Command:
aibrix_download
--model-uri
tos://aibrix-artifact-testing/models/deepseek-llm-7b-chat/
--local-dir
/models/
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 24 Jan 2025 14:22:43 -0800
Finished: Fri, 24 Jan 2025 14:22:43 -0800
Ready: True
Restart Count: 0
Environment:
DOWNLOADER_MODEL_NAME: deepseek-llm-7b-chat
DOWNLOADER_NUM_THREADS: 16
DOWNLOADER_ALLOW_FILE_SUFFIX: json, safetensors, bin
TOS_ACCESS_KEY: <set to the key 'TOS_ACCESS_KEY' in secret 'tos-credential'> Optional: false
TOS_SECRET_KEY: <set to the key 'TOS_SECRET_KEY' in secret 'tos-credential'> Optional: false
TOS_ENDPOINT: tos-cn-beijing.ivolces.com
TOS_REGION: cn-beijing
Mounts:
/models from model-hostpath (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lbmbm (ro)
Containers:
vllm-openai:
Container ID: containerd://f6d4bd7aa7cd02bac1a04c34d484484937fd3ef3c0d871e892c4118f42246992
Image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.2-distributed
Image ID: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai@sha256:9b0a651d62047fc40d94c5c210e266dc6b1f08446366e92e85a44e5dad79c805
Port: 8000/TCP
Host Port: 0/TCP
Command:
python3
-m
vllm.entrypoints.openai.api_server
--host
0.0.0.0
--port
8000
--model
/models/deepseek-llm-7b-chat
--served-model-name
deepseek-llm-7b-chat
--trust-remote-code
--api-key
sk-kFJ12nKsFVfVmGpj3QzX65s4RbN2xJqWzPYCjYu7wT3BlbLi
--dtype
half
State: Running
Started: Fri, 24 Jan 2025 14:22:43 -0800
Ready: False
Restart Count: 0
Limits:
nvidia.com/gpu: 1
vke.volcengine.com/eni-ip: 1
Requests:
nvidia.com/gpu: 1
vke.volcengine.com/eni-ip: 1
Liveness: http-get http://:8000/health delay=90s timeout=1s period=5s #success=1 #failure=3
Readiness: http-get http://:8000/health delay=90s timeout=1s period=5s #success=1 #failure=3
Environment: <none>
Mounts:
/dev/shm from dshm (rw)
/models from model-hostpath (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lbmbm (ro)
aibrix-runtime:
Container ID: containerd://a325d1064a424d70538c826927e7cd8f51edb8c3693bbc1f0a456b75072b445b
Image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0-rc.1
Image ID: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime@sha256:0ac5b7ef285ea894c0b5bc9c1913dd2f7622301fe2a84a78e250a67f9a948fd8
Port: 8080/TCP
Host Port: 0/TCP
Command:
aibrix_runtime
--port
8080
State: Running
Started: Fri, 24 Jan 2025 14:22:43 -0800
Ready: True
Restart Count: 0
Liveness: http-get http://:8080/healthz delay=3s timeout=1s period=2s #success=1 #failure=3
Readiness: http-get http://:8080/ready delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
INFERENCE_ENGINE: vllm
INFERENCE_ENGINE_ENDPOINT: http://localhost:8000
Mounts:
/models from model-hostpath (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lbmbm (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
model-hostpath:
Type: HostPath (bare host directory volume)
Path: /root/models
HostPathType: DirectoryOrCreate
dshm:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: 4Gi
kube-api-access-lbmbm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 64s default-scheduler Successfully assigned default/aibrix-model-deepseek-llm-7b-chat-6d879fd5-z4rfx to 10.0.0.44
Normal Pulled 64s kubelet Container image "aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.1" already present on machine
Normal Created 63s kubelet Created container init-model
Normal Started 63s kubelet Started container init-model
Normal Pulled 63s kubelet Container image "aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.2-distributed" already present on machine
Normal Created 63s kubelet Created container vllm-openai
Normal Started 63s kubelet Started container vllm-openai
Normal Pulled 63s kubelet Container image "aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0-rc.1" already present on machine
Normal Created 63s kubelet Created container aibrix-runtime
Normal Started 63s kubelet Started container aibrix-runtime
Steps to Reproduce
No response
Expected behavior
No response
Environment
No response
@gangmuk did you check the logs?
BTW, why not update your yaml to use v0.2.0-rc.2 for the runtime? I think you forgot to upgrade the other image
What log do you mean? pod log? @Jeffwan
I didn't know I was supposed to update that part in yaml. I was using what it has been running there. Will update it.
tagging @nwangfw @happyandslow to give a pointer.
What log do you mean? pod log?
Yes, pod logs. In that case, we know more details which container is not up, what's the reason it's not up.
I didn't know I was supposed to update that part in yaml. I was using what it has been running there. Will update it.
This is not a very mature project and evolve fast as well. Do not expect everything is correct. Feel free to check all the details and figure out problem yourselves. :D
Let me know whether you bring up the pod after bumping the version