server
server copied to clipboard
After load a model, Triton server suddenly not work that it shows CUDA failed to initialize. Unknown error (error 999).
Description
Now I have a laptop, the spec is quite new. Using RTX4070, 32 GB RAM memory and has 24 CPU cores.
When I start the Triton server, it can detect the GPU device and work well. However, when I moved out the model and put it back again, then it was broken. It shows The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available. [[Unknown error (error 999)]]
and Unable to destroy DCGM group: setting not configured
Triton Information
- Triton Server Container:
v23.08
- GPU driver version is 535.154.05
Here is my docker-compose file setting.
tritonserver:
image: nvcr.io/nvidia/tritonserver:23.02-py3
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
command: tritonserver --model-store=/models --model-control-mode="poll" --log-info true --strict-model-config=false
privileged: true
volumes:
- /var/database/data/models:/models
shm_size: 20gb
ulimits:
stack: 67108864
networks:
webgateway:
ipv4_address: 172.24.10.3
container_name: trt_serving
restart: always
To Reproduce It cannot reproduce.
Expected behavior Triton can detect the GPU device.
I have no idea about this problem. Your suggestions would be greatly appreciated. Thank you.
@chiehpower Hi, could you check out this link? You might want to set count: all
.
Also, can you try running nvidia-smi
inside of docker container withoug running tritonserver? See what happens.
Sure. I will try it. Thank you. It is very weird that after I reboot the computer, it works well now. :O I will close the issue first, if next time it happens again, I will turn on the issue again. Thank you so much Yinggeh!
I have this same issue, it gets fixed when I reboot my computer or load and reload nvidia_uvm, but It appears again after some time. Driver Version: 550.40.07 tritonserver image 24.01-py3
@rozalinda360 Thanks for your feedback. Could you provide more details on how to reproduce this issue?
@yinggeh Unfortunately, I am still not certain how to reproduce the issue.
Hello @yinggeh so I was experimenting with multiple things and I was able to reproduce the error finally. Apparently, this occurs whenever my PC goes to sleep. The error messages I got are as followed:
1. After making the PC go to sleep:
client:
server:
2. Exiting from tritonserver container
server:
3.After trying to relaunch the server container again
server:
4. After unloading and reloading nvidia_uvm:
server
Hi @rozalinda360. Thanks for waiting. I have opened the ticket DLIS-6455 for our engineers to investigate.