2 3090s + nvidia-docker containers = READ-ONLY file system crash
Hello, I am trying to run two different nvidia-based containers on two 3090s at the same time. The code in the two containers have no need to communicated at all. The behavior is consistent across everything I've tried: The code stops running after about 15 minutes and my root file system flips into READ-ONLY mode. I can run two containers simultaneously on the same GPU. However, I can not run two containers with one on each, nor can I run one container across both with --gpus all. No matter how I run containers, if I am using both cards, then after about 15 minutes, the code stops running because the filesystem flips to READ-ONLY.
OS: Linux nus-System-Product-Name 5.4.0-58-generic #64~18.04.1-Ubuntu SMP Wed Dec 9 17:11:11 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz 2 NVIDIA GeForce RTX 3090 cards (Zotac) Driver Version: 455.38 CUDA Version: 11.1 SSD: Micron Technology Inc Device 51a2 (rev 01) +++++++++++++++++++++
$ nvidia-docker version NVIDIA Docker: 2.5.0 Client: Docker Engine - Community Version: 19.03.14 API version: 1.40 Go version: go1.13.15 Git commit: 5eb3275d40 Built: Tue Dec 1 19:20:17 2020 OS/Arch: linux/amd64 Experimental: false
Cuda details: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Tue_Sep_15_19:10:02_PDT_2020 Cuda compilation tools, release 11.1, V11.1.74 Build cuda_11.1.TC455_06.29069683_0
My container is built FROM nvcr.io/nvidia/pytorch:20.12-py3 (also tried earler monthly container versions)
Commands: $ docker run --gpus "device=0" --ipc=host -it -v $(pwd):/sonyGan -v /scratch/lonce:/mydata:ro --rm foo:bar $ docker run --gpus "device=1" --ipc=host -it -v $(pwd):/sonyGan -v /scratch/lonce:/mydata:ro --rm foo:bar
(altho containers don't need to communicate, I also tried: ) $ docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --gpus "device=0" -it -v $(pwd):/sonyGan -v /scratch/lonce:/mydata:ro --rm foo:bar $ docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --gpus "device=1" -it -v $(pwd):/sonyGan -v /scratch/lonce:/mydata:ro --rm foo:bar
Other info the might be helpful: When dockers start, I get this message in the splash screen:
NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced.
When both containers are fired up and running, at first all looks good. nvidia-smi shows both GPUs to be utilized at about 30%, their memory use is about 10%, fans are at 50%, temperature at about 70 C.
Then after about 15 minutes, boomofo, everything stops because the file system becomes READ-ONLY. Every time. Sometimes my display resolution is wrong when I reboot.
After enabling nvidia runtime debug, then when I try to exit the container after the freeze, I get the following error:
ERRO[0874] Error waiting for container: container ba2591b72328f835251293f27085efeb0c16ee6abfcabd6f84b115ec3641366b: driver "overlay2" failed to remove root filesystem: unlinkat /var/lib/docker/overlay2/bd2a8e59e20f75ecb49fc99ebe9d81cf4842951cce31b98aaa6945196fe44a4a: read-only file system
Any help so that I can use both cards would, of course, be greatly appreciated.
- lonce
p.s. Additional trouble shooting: I changed my nvidia and system log file destinations to a different mounted disk that remains writable, but no logs were written there
This is interesting, I've never heard of something like this happening.
One thing to keep in mind is that the code run by nvidia-docker is completely passive.
I.e. it only runs at container startup to inject a set of devices, binaries, and libraries into your container, and then run ldconfig to put those libraries into your library path. Once this is done this, docker takes over and does the rest.
In fact, you can manually achieve everything that nvidia-docker gives you with a set of -v and --device options on the docker command line.
As such, it seems hard to imagine that nvdia-docker itself would contribute to your file-system flipping to read-only mode.
Do you see anything in the docker logs that could give a hint to this?
I did not. In fact, I am now seeing this problem when running a docker on just one device. It is starting to look like a hardware issue. Memset showed no problems with memory. Next, I hear gpu-burn is a good analytical tool. Oi. Constant checkpoint restarting is driving me crazy.
Hi there, did you solve your problem? I'm still struggling with this for building from an NVIDIA base image on a Windows machine - if it is a hardware issue, why would it be able to build some layers from the Dockerfile but not the read-only ones? I also asked a similar question here: https://forums.docker.com/t/docker-build-failed-to-register-layer-read-only-file-system/107336
I am also experiencing this, but with a few differences:
- I am on Ubuntu 21.04
- The timing is much less predictable. I can sometimes train for 10 hours before it happens.
- It usually happens while writing to tensorboard or saving a model backup during the callback stage, but not always.
Since I am mostly able to train for hours at a time, I probably won’t get around to troubleshooting it soon, but I am very curious what this is. If anyone learns something more, please do share.