mxnet icon indicating copy to clipboard operation
mxnet copied to clipboard

[CI] broken GPU testing stage

Open bgawrych opened this issue 3 years ago • 3 comments

Description

CI jobs running on GPU (centos-gpu, unix-gpu and website) fails with following error: [2022-04-20T13:09:33.419Z] docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown. ... [2022-04-20T13:09:33.419Z] docker: Error response from daemon: Unknown runtime specified nvidia.

Occurrences

PR#1 PR#2

@DickJC123

bgawrych avatar Apr 21 '22 07:04 bgawrych

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

github-actions[bot] avatar Apr 21 '22 07:04 github-actions[bot]

Before I dive into this more, could you check if the suggestions here for rebooting are helpful in this case: https://stackoverflow.com/questions/65721900/failed-to-initialize-nvml-driver-library-version-mismatch-is-ubuntu-server

DickJC123 avatar May 02 '22 22:05 DickJC123

Hi, I actually fixed the original issue by creating updated AMIs.

I believe the new issue is around new keys deployed by Nvidia for the Cuda and ML repos, but the docker images don't contain these keys. I'm expecting Nvidia to publish new docker images soon with the updated keys, based on these threads:

https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771 https://gitlab.com/nvidia/container-images/cuda/-/issues/158

josephevans avatar May 02 '22 23:05 josephevans