server icon indicating copy to clipboard operation
server copied to clipboard

After load a model, Triton server suddenly not work that it shows CUDA failed to initialize. Unknown error (error 999).

Open chiehpower opened this issue 4 months ago • 7 comments

Description

Now I have a laptop, the spec is quite new. Using RTX4070, 32 GB RAM memory and has 24 CPU cores. When I start the Triton server, it can detect the GPU device and work well. However, when I moved out the model and put it back again, then it was broken. It shows The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available. [[Unknown error (error 999)]] and Unable to destroy DCGM group: setting not configured

2024-03-05_08-49

image

Triton Information

  • Triton Server Container: v23.08
  • GPU driver version is 535.154.05

Here is my docker-compose file setting.

  tritonserver:
    image: nvcr.io/nvidia/tritonserver:23.02-py3
    deploy:
     resources:
       reservations:
         devices:
         - driver: nvidia
           capabilities: [gpu]
    command: tritonserver --model-store=/models --model-control-mode="poll" --log-info true --strict-model-config=false
    privileged: true
    volumes:
      - /var/database/data/models:/models
    shm_size: 20gb
    ulimits:
      stack: 67108864
    networks:
      webgateway:
        ipv4_address: 172.24.10.3
    container_name: trt_serving
    restart: always

To Reproduce It cannot reproduce.

Expected behavior Triton can detect the GPU device.

I have no idea about this problem. Your suggestions would be greatly appreciated. Thank you.

chiehpower avatar Mar 05 '24 01:03 chiehpower

@chiehpower Hi, could you check out this link? You might want to set count: all.

Also, can you try running nvidia-smi inside of docker container withoug running tritonserver? See what happens.

yinggeh avatar Mar 05 '24 02:03 yinggeh

Sure. I will try it. Thank you. It is very weird that after I reboot the computer, it works well now. :O I will close the issue first, if next time it happens again, I will turn on the issue again. Thank you so much Yinggeh!

chiehpower avatar Mar 05 '24 06:03 chiehpower

I have this same issue, it gets fixed when I reboot my computer or load and reload nvidia_uvm, but It appears again after some time. Driver Version: 550.40.07 tritonserver image 24.01-py3

rozalinda360 avatar Mar 18 '24 05:03 rozalinda360

@rozalinda360 Thanks for your feedback. Could you provide more details on how to reproduce this issue?

yinggeh avatar Mar 19 '24 07:03 yinggeh

@yinggeh Unfortunately, I am still not certain how to reproduce the issue.

rozalinda360 avatar Mar 24 '24 07:03 rozalinda360

Hello @yinggeh so I was experimenting with multiple things and I was able to reproduce the error finally. Apparently, this occurs whenever my PC goes to sleep. The error messages I got are as followed:

1. After making the PC go to sleep: client: error_client1 error_client2 server: server_error_sleep

2. Exiting from tritonserver container server: server_error_exit server_error_exit2

3.After trying to relaunch the server container again server: server_error_new server_error_new2 server_error_new3

4. After unloading and reloading nvidia_uvm: server server_running

rozalinda360 avatar Mar 25 '24 06:03 rozalinda360

Hi @rozalinda360. Thanks for waiting. I have opened the ticket DLIS-6455 for our engineers to investigate.

yinggeh avatar Apr 10 '24 22:04 yinggeh