MONAI icon indicating copy to clipboard operation
MONAI copied to clipboard

nnUNetV2Runner cannot be run with NVIDIA MIG configuration

Open che85 opened this issue 1 year ago • 0 comments

Describe the bug

python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 --gpu_id {MIG_UUID}

When providing the UUID of the MIG device as gpu_id, I am getting the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.10/dist-packages/nnunetv2/run/run_training.py", line 113, in run_ddp
    torch.cuda.set_device(torch.device('cuda', dist.get_rank()))
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Similarly, setting CUDA_VISIBLE_DEVICES (CUDA_VISIBLE_DEVICES={MIG_UUID} python -m monai.apps.nnunet nnUNetV2Runner train_single_model) is overwritten by nnUNetV2Runner and not working.

Running nnUNet natively works fine with:

CUDA_VISIBLE_DEVICES={MIG_UUID} nnUNetv2_train ... 2d 4

To Reproduce Steps to reproduce the behavior:

  1. Use computer with MIG device
  2. run
python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 --gpu_id {MIG_UUID}

OR

CUDA_VISIBLE_DEVICES={MIG_UUID} python -m monai.apps.nnunet nnUNetV2Runner train_single_model --input_config "./input.yaml" --config "2d" --fold 0 

Expected behavior

CUDA_VISIBLE_DEVICES should not be overwritten if it was provided.

che85 avatar Feb 26 '24 15:02 che85