server icon indicating copy to clipboard operation
server copied to clipboard

Python backend status zombie but Tritonserver `v2/health` still return 200 OK

Open burling opened this issue 1 year ago • 1 comments

Description I am using triton vllm backend. I met problem that python backend status zombie but Tritonserver v2/health still return 200 OK. Parent process should also exit?

[root@llmserver-f-llmcs4-9 log]# ps -ef|grep triton
root         657     645  0 May11 pts/0    00:03:25 /opt/tritonserver/bin/tritonserver --model-repository=/usr/local/xxx/bin//triton0/model_repo --http-port=8000 --grpc-port=8001 --log-file=/usr/local/xxx/log/
root        1059     657  0 May11 pts/0    00:19:02 [triton_python_b] <defunct>
root      488571  456879  0 19:24 pts/3    00:00:00 grep --color=auto triton

[root@llmserver-f-llmcs4-9 log]# head -n 5 /proc/1059/status
Name:   triton_python_b
State:  Z (zombie)
Tgid:   1059
Ngid:   0
Pid:    1059

[root@llmserver-f-llmcs4-9 log]# curl http://127.0.0.1:8002/v2/health -v
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8002 (#0)
> GET /v2/health HTTP/1.1
> Host: 127.0.0.1:8002
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< Content-Length: 0
< 
* Connection #0 to host 127.0.0.1 left intact

Triton Information TRITON_SERVER_VERSION="v2.42.0" TRITON_DEPS_VERSION="r24.01"

Are you using the Triton container or did you build it yourself? I built the container myself.

Expected behavior Triton server should recognize the backend process zombie, then exit triton process or return unhealthy state.

burling avatar May 16 '24 11:05 burling

Hi @burling, thanks for filing the issue. Could you please provide the repro steps and the model files so that we can investigate further? I think the python backend stub shouldn't be in zombie states and should be cleaned up if the stub is not healthy. Besides, could you try with the latest Triton version and see if the issue still happens?

krishung5 avatar May 17 '24 18:05 krishung5