server icon indicating copy to clipboard operation
server copied to clipboard

On server/deploy/oci -> running "helm install example ." to deploy the Inference Server and pod doesn't get to running due to Liveness probe failed & Readiness probe failed

Open aviv12825 opened this issue 1 year ago • 1 comments

On server/deploy/oci - running "helm install example ." to deploy the Inference Server and pod doesn't get to running due to Liveness probe failed & Readiness probe failed.

Below describe log details & I try to add to templates\deployment.yaml file the initialDelaySeconds: 180 which didn't help. Can someone please advise ?

Events: Type Reason Age From Message


Normal Scheduled 4m11s default-scheduler Successfully assigned default/example-triton-inference-server-9c5d9f79-74rt4 to 10.0.10.95 Warning Unhealthy 41s (x3 over 61s) kubelet Liveness probe failed: Get "http://10.0.10.177:8000/v2/health/live": dial tcp 10.0.10.177:8000: connect: connection refused Normal Killing 41s kubelet Container triton-inference-server failed liveness probe, will be restarted Normal Pulled 11s (x2 over 4m10s) kubelet Container image "nvcr.io/nvidia/tritonserver:24.03-py3" already present on machine Warning Unhealthy 11s (x13 over 66s) kubelet Readiness probe failed: Get "http://10.0.10.177:8000/v2/health/ready": dial tcp 10.0.10.177:8000: connect: connection refused Normal Created 10s (x2 over 4m10s) kubelet Created container triton-inference-server Normal Started 10s (x2 over 4m10s) kubelet Started container triton-inference-server

aviv12825 avatar Apr 24 '24 19:04 aviv12825

Hi @aviv12825,

I see the errors returned involve "connection refused". Have you confirmed from the pod logs that the server itself started up successfully to expose these endpoints?

rmccorm4 avatar May 01 '24 00:05 rmccorm4

Closing due to lack of activity. Please re-open the issue if you would like to follow up with this issue.

krishung5 avatar Aug 26 '24 22:08 krishung5