mxnet icon indicating copy to clipboard operation
mxnet copied to clipboard

CI Node nvml configuration issue for website and centos-gpu pipelines

Open DickJC123 opened this issue 2 years ago • 1 comments

Description

log output:

nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.

This can be caused by a mismatch between the installed driver and the currently loaded one. Might be cured by a machine reboot?

Occurrences

Python Docs job: https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Fwebsite/detail/PR-21104/7/pipeline

Also, all 3 jobs in this centos-gpu pipeline:

https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-21104/6/pipeline/143

What have you tried to solve it?

  1. Retry to bypass.

DickJC123 avatar Aug 03 '22 00:08 DickJC123

Thanks for filing, @DickJC123. I believe when the system is launched, auto-updates are overriding some packages, causing the mismatch between driver and nvidia-docker. I've manually updated the AMIs used to update all the dependency software, so it should be back to normal. We are working on automating this effort.

josephevans avatar Aug 03 '22 23:08 josephevans