mxnet
mxnet copied to clipboard
CI Node nvml configuration issue for website and centos-gpu pipelines
Description
log output:
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.
This can be caused by a mismatch between the installed driver and the currently loaded one. Might be cured by a machine reboot?
Occurrences
Python Docs job: https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Fwebsite/detail/PR-21104/7/pipeline
Also, all 3 jobs in this centos-gpu pipeline:
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-21104/6/pipeline/143
What have you tried to solve it?
- Retry to bypass.
Thanks for filing, @DickJC123. I believe when the system is launched, auto-updates are overriding some packages, causing the mismatch between driver and nvidia-docker. I've manually updated the AMIs used to update all the dependency software, so it should be back to normal. We are working on automating this effort.