mxnet
mxnet copied to clipboard
[CI] broken GPU testing stage
Description
CI jobs running on GPU (centos-gpu, unix-gpu and website) fails with following error:
[2022-04-20T13:09:33.419Z] docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.
...
[2022-04-20T13:09:33.419Z] docker: Error response from daemon: Unknown runtime specified nvidia.
Occurrences
@DickJC123
Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.
Before I dive into this more, could you check if the suggestions here for rebooting are helpful in this case: https://stackoverflow.com/questions/65721900/failed-to-initialize-nvml-driver-library-version-mismatch-is-ubuntu-server
Hi, I actually fixed the original issue by creating updated AMIs.
I believe the new issue is around new keys deployed by Nvidia for the Cuda and ML repos, but the docker images don't contain these keys. I'm expecting Nvidia to publish new docker images soon with the updated keys, based on these threads:
https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771 https://gitlab.com/nvidia/container-images/cuda/-/issues/158