mxnet [CI] broken GPU testing stage

Description

CI jobs running on GPU (centos-gpu, unix-gpu and website) fails with following error: [2022-04-20T13:09:33.419Z] docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown. ... [2022-04-20T13:09:33.419Z] docker: Error response from daemon: Unknown runtime specified nvidia.

Occurrences

PR#1 PR#2

@DickJC123

Apr 21 '22 07:04 bgawrych

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

Apr 21 '22 07:04 github-actions[bot]

Before I dive into this more, could you check if the suggestions here for rebooting are helpful in this case: https://stackoverflow.com/questions/65721900/failed-to-initialize-nvml-driver-library-version-mismatch-is-ubuntu-server

May 02 '22 22:05 DickJC123

Hi, I actually fixed the original issue by creating updated AMIs.

I believe the new issue is around new keys deployed by Nvidia for the Cuda and ML repos, but the docker images don't contain these keys. I'm expecting Nvidia to publish new docker images soon with the updated keys, based on these threads:

https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771 https://gitlab.com/nvidia/container-images/cuda/-/issues/158

May 02 '22 23:05 josephevans

mxnet mxnet copied to clipboard

[CI] broken GPU testing stage

Description

Occurrences

mxnet
mxnet copied to clipboard