DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

[nnUNet/PyTorch] PyTorch Libary Import Error with most recent release

Open tjhendrickson opened this issue 2 years ago • 4 comments

Related to nnUNet/PyTorch(s) (e.g. GNMT/PyTorch or FasterTransformer/All)

Describe the bug

Within Docker container, typing python main.py --help produces a traceback error.

Traceback (most recent call last):
  File "main.py", line 19, in <module>
    from pytorch_lightning import Trainer, seed_everything
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in <module>
    from pytorch_lightning.metrics.utils import deprecated_metrics
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/utils.py", line 22, in <module>
    from torchmetrics.utilities.data import get_num_classes as _get_num_classes
ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/data.py)

To Reproduce Steps to reproduce the behavior:

  1. Create Docker image by following quick start guide on nnUNet for PyTorch
  2. "Shell" into container with sudo docker run -it nnunet:latest /bin/bash
  3. Execute main.py python main.py --help

tjhendrickson avatar Apr 18 '22 21:04 tjhendrickson

Downgrading torchmetrics to v0.6.0 seems to resolve the issue.

tjhendrickson avatar Apr 19 '22 18:04 tjhendrickson

Unfortunately after modifying the torchmetrics version I am now running into a different traceback error:

  File "main.py", line 34, in <module>
    set_affinity(int(os.getenv("LOCAL_RANK", "0")), args.gpus, mode=args.affinity)
  File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 376, in set_affinity
    set_socket_unique_affinity(gpu_id, nproc_per_node, cores, "contiguous", balanced)
  File "/workspace/nnunet_pyt/utils/gpu_affinity.py", line 263, in set_socket_unique_affinity
    os.sched_setaffinity(0, ungrouped_affinities[gpu_id])
OSError: [Errno 22] Invalid argument

This error seems to persist no matter what text I enter following the --affinity flag

tjhendrickson avatar Apr 19 '22 22:04 tjhendrickson

Have you tried running with --affinity disabled or commenting the L32-33 in the main.py? (https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Segmentation/nnUNet/main.py#L32).

Another fix for torchmetrics is to upgrade pytorch lightning to 1.5.10 (there are issues with 1.6.0 at the moment)

michal2409 avatar Apr 20 '22 09:04 michal2409

Have you tried running with --affinity disabled or commenting the L32-33 in the main.py? (https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Segmentation/nnUNet/main.py#L32).

Another fix for torchmetrics is to upgrade pytorch lightning to 1.5.10 (there are issues with 1.6.0 at the moment)

michal2409 avatar Apr 20 '22 09:04 michal2409