training
training copied to clipboard
docker run error for image_segmentation/pytorch test following the guide
the guide link is image_segmentation/pytorch
when I try to run the container, I got below error, mention the runtime nvidia does not exist. could you please shed some light?
[stg@oq1 pytorch]$ sudo docker run --ipc=host -it --rm --runtime=nvidia -v /mnt/pytorch/mlperf/1/training/image_segmentation/pytorch/raw-data-dir:/raw_data -v /mnt/pytorch/mlperf/1/training/image_segmentation/pytorch/data:/data -v /mnt/pytorch/mlperf/1/training/image_segmentation/pytorch/results:/results unet3d:latest /bin/bash
[sudo] password for stg:
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
[stg@oq1 pytorch]$
I am using FedoraOS37, I failed to install cuda container support because this scripts does not support FedoraOS
[stg@oq1 training]$ sudo sh install_cuda_docker.sh
[sudo] password for stg:
--2023-11-23 19:31:16-- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190 [application/octet-stream]
Saving to: ‘cuda-ubuntu2004.pin’
cuda-ubuntu2004.pin 100%[======================================================================================================================>] 190 --.-KB/s in 0s
2023-11-23 19:31:16 (10.6 MB/s) - ‘cuda-ubuntu2004.pin’ saved [190/190]
mv: cannot move 'cuda-ubuntu2004.pin' to '/etc/apt/preferences.d/cuda-repository-pin-600': No such file or directory
--2023-11-23 19:31:16-- https://developer.download.nvidia.com/compute/cuda/11.6.0/local_installers/cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2681112370 (2.5G) [application/x-deb]
Saving to: ‘cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb’
cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_am 100%[======================================================================================================================>] 2.50G 21.1MB/s in 2m 36s
2023-11-23 19:33:53 (16.4 MB/s) - ‘cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb’ saved [2681112370/2681112370]
sudo: dpkg: command not found
sudo: apt-key: command not found
sudo: apt-get: command not found
sudo: apt-get: command not found
sudo: apt-get: command not found
gpg: can't create '/usr/share/keyrings/docker-archive-keyring.gpg': No such file or directory
gpg: no valid OpenPGP data found.
gpg: dearmoring failed: No such file or directory
curl: (23) Failed writing body
install_cuda_docker.sh: line 15: dpkg: command not found
install_cuda_docker.sh: line 15: lsb_release: command not found
tee: /etc/apt/sources.list.d/docker.list: No such file or directory
sudo: apt: command not found
sudo: apt-get: command not found
guys, I installed nvidia docker in fedora, now I can start container, but when I run next step it shows me error like below. how to fix this?
root@6ec7b9c99e06:/# ls
bin boot data dev etc home lib lib64 media mnt opt proc raw_data results root run sbin srv sys tmp usr var workspace
root@6ec7b9c99e06:/# cd workspace/unet3d/
root@6ec7b9c99e06:/workspace/unet3d# python3 preprocess_dataset.py --data_dir /raw_data --results_dir /data
Preprocessing /raw_data
/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/opt/conda/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
Mean value: nan, std: nan, d: nan, h: nan, w: nan
Traceback (most recent call last):
File "preprocess_dataset.py", line 147, in <module>
verify_dataset(args.results_dir)
File "preprocess_dataset.py", line 127, in verify_dataset
assert len(source) == len(os.listdir(results_dir))
AssertionError
root@6ec7b9c99e06:/workspace/unet3d#
guys, I install host OS with Ubuntun22.04, I still see this error, could you please shed some light?
dcg@oq1:/mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch$ sudo docker run --ipc=host -it --rm --runtime=nvidia -v /mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch/raw-data-dir:/raw_data -v /mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch/data:/data -v /mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch/results:/results unet3d:latest /bin/bash
root@7f2d8fc3d617:/workspace/unet3d# ls
Dockerfile LICENCE README.md checksum.json data_loading evaluation_cases.txt main.py model oldREADME.md preprocess_dataset.py requirements.txt run_and_time.sh runtime
root@7f2d8fc3d617:/workspace/unet3d# python3 preprocess_dataset.py --data_dir /raw_data --results_dir /data
Preprocessing /raw_data
/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/opt/conda/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
Mean value: nan, std: nan, d: nan, h: nan, w: nan
Traceback (most recent call last):
File "preprocess_dataset.py", line 147, in <module>
verify_dataset(args.results_dir)
File "preprocess_dataset.py", line 127, in verify_dataset
assert len(source) == len(os.listdir(results_dir))
AssertionError
root@7f2d8fc3d617:/workspace/unet3d#
Sorry but the unet3d benchmark is dropped from the training benchmarks suite so this issue cannot be addressed at this time.