DeepLearningExamples
DeepLearningExamples copied to clipboard
[nnUNet/PyTorch] PyTorch Libary Import Error with most recent release
Related to nnUNet/PyTorch(s) (e.g. GNMT/PyTorch or FasterTransformer/All)
Describe the bug
Within Docker container, typing python main.py --help
produces a traceback error.
Traceback (most recent call last):
File "main.py", line 19, in <module>
from pytorch_lightning import Trainer, seed_everything
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
from pytorch_lightning import metrics # noqa: E402
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
from pytorch_lightning.metrics.classification import ( # noqa: F401
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
from pytorch_lightning.metrics.classification.accuracy import Accuracy # noqa: F401
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in <module>
from pytorch_lightning.metrics.utils import deprecated_metrics
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/utils.py", line 22, in <module>
from torchmetrics.utilities.data import get_num_classes as _get_num_classes
ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/data.py)
To Reproduce Steps to reproduce the behavior:
- Create Docker image by following quick start guide on nnUNet for PyTorch
- "Shell" into container with
sudo docker run -it nnunet:latest /bin/bash
- Execute main.py
python main.py --help
Downgrading torchmetrics to v0.6.0
seems to resolve the issue.
Unfortunately after modifying the torchmetrics version I am now running into a different traceback error:
File "main.py", line 34, in <module>
set_affinity(int(os.getenv("LOCAL_RANK", "0")), args.gpus, mode=args.affinity)
File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 376, in set_affinity
set_socket_unique_affinity(gpu_id, nproc_per_node, cores, "contiguous", balanced)
File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 263, in set_socket_unique_affinity
os.sched_setaffinity(0, ungrouped_affinities[gpu_id])
OSError: [Errno 22] Invalid argument
This error seems to persist no matter what text I enter following the --affinity
flag
Have you tried running with --affinity disabled
or commenting the L32-33 in the main.py
? (https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Segmentation/nnUNet/main.py#L32).
Another fix for torchmetrics is to upgrade pytorch lightning to 1.5.10 (there are issues with 1.6.0 at the moment)
Have you tried running with --affinity disabled
or commenting the L32-33 in the main.py
? (https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Segmentation/nnUNet/main.py#L32).
Another fix for torchmetrics is to upgrade pytorch lightning to 1.5.10 (there are issues with 1.6.0 at the moment)