aibrix Pod status in kubectl get pod is wrong

🐛 Describe the bug

This is the screenshot of kubectl get pod. You can see the STATUS is 'running' even if the pod is not actually ready (see the READY column) It can mess up many things, for example sending requests to not-ready pod or even worse.

Suspicion: I think aibrix has weird readiness probe logic. maybe it is only considering init container not all container.

When 1/2 containers is ready, the pod status is running too.

This is kubectl describe pod that has 1/2 container is ready and status is running.

k describe pod aibrix-model-deepseek-llm-7b-chat-6d879fd5-z4rfx
Name:             aibrix-model-deepseek-llm-7b-chat-6d879fd5-z4rfx
Namespace:        default
Priority:         0
Service Account:  default
Node:             10.0.0.44/10.0.0.44
Start Time:       Fri, 24 Jan 2025 14:22:41 -0800
Labels:           model.aibrix.ai/name=deepseek-llm-7b-chat
                  pod-template-hash=6d879fd5
Annotations:      kubectl.kubernetes.io/restartedAt: 2025-01-24T14:17:51-08:00
                  prometheus.io/path: /metrics
                  prometheus.io/port: 8000
                  prometheus.io/scrape: true
                  vke.volcengine.com/cello-pod-evict-policy: allow
Status:           Running
IP:               10.0.0.59
IPs:
  IP:           10.0.0.59
Controlled By:  ReplicaSet/aibrix-model-deepseek-llm-7b-chat-6d879fd5
Init Containers:
  init-model:
    Container ID:  containerd://711107184fd8812c8cf4c0e6a3b841c7b640701cd640c8da59fbf7ec40d7ac6b
    Image:         aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.1
    Image ID:      aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime@sha256:e89f2714affcb7ca4204f9a11c6e9e2d99edeb003c550a68ed0f70e80865160c
    Port:          <none>
    Host Port:     <none>
    Command:
      aibrix_download
      --model-uri
      tos://aibrix-artifact-testing/models/deepseek-llm-7b-chat/
      --local-dir
      /models/
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 24 Jan 2025 14:22:43 -0800
      Finished:     Fri, 24 Jan 2025 14:22:43 -0800
    Ready:          True
    Restart Count:  0
    Environment:
      DOWNLOADER_MODEL_NAME:         deepseek-llm-7b-chat
      DOWNLOADER_NUM_THREADS:        16
      DOWNLOADER_ALLOW_FILE_SUFFIX:  json, safetensors, bin
      TOS_ACCESS_KEY:                <set to the key 'TOS_ACCESS_KEY' in secret 'tos-credential'>  Optional: false
      TOS_SECRET_KEY:                <set to the key 'TOS_SECRET_KEY' in secret 'tos-credential'>  Optional: false
      TOS_ENDPOINT:                  tos-cn-beijing.ivolces.com
      TOS_REGION:                    cn-beijing
    Mounts:
      /models from model-hostpath (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lbmbm (ro)
Containers:
  vllm-openai:
    Container ID:  containerd://f6d4bd7aa7cd02bac1a04c34d484484937fd3ef3c0d871e892c4118f42246992
    Image:         aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.2-distributed
    Image ID:      aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai@sha256:9b0a651d62047fc40d94c5c210e266dc6b1f08446366e92e85a44e5dad79c805
    Port:          8000/TCP
    Host Port:     0/TCP
    Command:
      python3
      -m
      vllm.entrypoints.openai.api_server
      --host
      0.0.0.0
      --port
      8000
      --model
      /models/deepseek-llm-7b-chat
      --served-model-name
      deepseek-llm-7b-chat
      --trust-remote-code
      --api-key
      sk-kFJ12nKsFVfVmGpj3QzX65s4RbN2xJqWzPYCjYu7wT3BlbLi
      --dtype
      half
    State:          Running
      Started:      Fri, 24 Jan 2025 14:22:43 -0800
    Ready:          False
    Restart Count:  0
    Limits:
      nvidia.com/gpu:             1
      vke.volcengine.com/eni-ip:  1
    Requests:
      nvidia.com/gpu:             1
      vke.volcengine.com/eni-ip:  1
    Liveness:                     http-get http://:8000/health delay=90s timeout=1s period=5s #success=1 #failure=3
    Readiness:                    http-get http://:8000/health delay=90s timeout=1s period=5s #success=1 #failure=3
    Environment:                  <none>
    Mounts:
      /dev/shm from dshm (rw)
      /models from model-hostpath (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lbmbm (ro)
  aibrix-runtime:
    Container ID:  containerd://a325d1064a424d70538c826927e7cd8f51edb8c3693bbc1f0a456b75072b445b
    Image:         aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0-rc.1
    Image ID:      aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime@sha256:0ac5b7ef285ea894c0b5bc9c1913dd2f7622301fe2a84a78e250a67f9a948fd8
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      aibrix_runtime
      --port
      8080
    State:          Running
      Started:      Fri, 24 Jan 2025 14:22:43 -0800
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:8080/healthz delay=3s timeout=1s period=2s #success=1 #failure=3
    Readiness:      http-get http://:8080/ready delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      INFERENCE_ENGINE:           vllm
      INFERENCE_ENGINE_ENDPOINT:  http://localhost:8000
    Mounts:
      /models from model-hostpath (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lbmbm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  model-hostpath:
    Type:          HostPath (bare host directory volume)
    Path:          /root/models
    HostPathType:  DirectoryOrCreate
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  4Gi
  kube-api-access-lbmbm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  64s   default-scheduler  Successfully assigned default/aibrix-model-deepseek-llm-7b-chat-6d879fd5-z4rfx to 10.0.0.44
  Normal  Pulled     64s   kubelet            Container image "aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.1" already present on machine
  Normal  Created    63s   kubelet            Created container init-model
  Normal  Started    63s   kubelet            Started container init-model
  Normal  Pulled     63s   kubelet            Container image "aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.2-distributed" already present on machine
  Normal  Created    63s   kubelet            Created container vllm-openai
  Normal  Started    63s   kubelet            Started container vllm-openai
  Normal  Pulled     63s   kubelet            Container image "aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0-rc.1" already present on machine
  Normal  Created    63s   kubelet            Created container aibrix-runtime
  Normal  Started    63s   kubelet            Started container aibrix-runtime

Steps to Reproduce

No response

Expected behavior

No response

Environment

No response

Jan 24 '25 22:01 gangmuk

@gangmuk did you check the logs?

BTW, why not update your yaml to use v0.2.0-rc.2 for the runtime? I think you forgot to upgrade the other image

Jan 25 '25 02:01 Jeffwan

What log do you mean? pod log? @Jeffwan

I didn't know I was supposed to update that part in yaml. I was using what it has been running there. Will update it.

tagging @nwangfw @happyandslow to give a pointer.

Jan 25 '25 06:01 gangmuk

What log do you mean? pod log?

Yes, pod logs. In that case, we know more details which container is not up, what's the reason it's not up.

I didn't know I was supposed to update that part in yaml. I was using what it has been running there. Will update it.

This is not a very mature project and evolve fast as well. Do not expect everything is correct. Feel free to check all the details and figure out problem yourselves. :D

Let me know whether you bring up the pod after bumping the version

Jan 25 '25 07:01 Jeffwan