MONAI icon indicating copy to clipboard operation
MONAI copied to clipboard

Issue with distributed `SyncBatchNorm` in MIL pipeline

Open bhashemian opened this issue 3 years ago • 11 comments

User has reported an issue with the MIL pipeline when used with the distributed flag. #5081

I have tested it with MONAI tag 0.9.1 and it works fine while it fails with latest version of MONAI. This issue needs to be investigated.

Log reported by the user:

Versions: NVIDIA Release 22.08 (build 42105213) PyTorch Version 1.13.0a0+d321be6 projectmonai/monai:latest DIGEST:sha256:109d2204811a4a0f9f6bf436eca624c42ed9bb3dbc6552c90b65a2db3130fefd

Error: Traceback (most recent call last): File "MIL.py", line 724, in mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args,)) File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/workspace/MIL.py", line 565, in main_worker train_loss, train_acc = train_epoch(model, train_loader, optimizer, scaler=scaler, epoch=epoch, args=args) File "/workspace/MIL.py", line 61, in train_epoch logits = model(data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1009, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 970, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl return forward_call(*input, **kwargs) File "/opt/monai/monai/networks/nets/milmodel.py", line 238, in forward x = self.net(x) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 270, in _forward_impl x = self.relu(x) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 102, in forward return F.relu(input, inplace=self.inplace) File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1453, in relu return handle_torch_function(relu, (input,), input, inplace=inplace) File "/opt/conda/lib/python3.8/site-packages/torch/overrides.py", line 1528, in handle_torch_function result = torch_func_method(public_api, types, args, kwargs) File "/opt/monai/monai/data/meta_tensor.py", line 249, in torch_function ret = super().torch_function(func, types, args, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 1089, in torch_function ret = func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1455, in relu result = torch.relu(input) RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

bhashemian avatar Sep 22 '22 19:09 bhashemian

@myron do you have any insight here?

bhashemian avatar Sep 23 '22 16:09 bhashemian

I'm not sure what it means

myron avatar Sep 27 '22 18:09 myron

I think it's related to the new MetaTensor somehow, plz see my issue here https://github.com/Project-MONAI/MONAI/issues/5283

myron avatar Oct 07 '22 03:10 myron

Hello! I also encountered a very similar error occurred to me too, though I ran a different code (Not sure if it's related).

The error occurs when I run DDP with a torch dataset I made using MONAI ImageDataset function. The funny thing is that it works when I run with only one GPU allocated, but failed when I try to use with multiple GPUs.

The monai version I was running was 1.0.0 with torch version 1.11.0+cu113

The error I got was the following :

data_path 도 config에서 받도록 하기
data_path 도 config에서 받도록 하기
<class 'monai.transforms.utility.array.AddChannel'>: Class `AddChannel` has been deprecated since version 0.8. please use MetaTensor data type and monai.transforms.EnsureChannelFirst instead.
<class 'monai.transforms.utility.array.AddChannel'>: Class `AddChannel` has been deprecated since version 0.8. please use MetaTensor data type and monai.transforms.EnsureChannelFirst instead.
Traceback (most recent call last):
  File "main_3D.py", line 348, in <module>
    main()
  File "main_3D.py", line 75, in main
    torch.multiprocessing.spawn(main_worker, (args,), args.ngpus_per_node)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/main_3D.py", line 141, in main_worker
    loss = model.forward(y1, y2)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/main_3D.py", line 222, in forward
    z1 = self.projector(self.backbone(y1))           #i.e. z1 : representation of y1 (before normalization)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/functional.py", line 1438, in relu
    return handle_torch_function(relu, (input,), input, inplace=inplace)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/overrides.py", line 1394, in handle_torch_function
    result = torch_func_method(public_api, types, args, kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/monai/data/meta_tensor.py", line 249, in __torch_function__
    ret = super().__torch_function__(func, types, args, kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/_tensor.py", line 1142, in __torch_function__
    ret = func(*args, **kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/functional.py", line 1440, in relu
    result = torch.relu_(input)
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

I could share the code, if anyone thinks that it might help with solving this issue! (However, I should say that since I am a novice at pytorch, the error I got might be a pytorch problem and not a MONAI one!)

dyhan316 avatar Oct 07 '22 03:10 dyhan316

update : DDP works when using MONAI 0.9.1. Therefore, I think it's the same issue as this the OP

dyhan316 avatar Oct 07 '22 05:10 dyhan316

It seems that this is a PyTorch issue that is caused by using MetaTensor (a subclass of torch.Tesnor). @wyli has created a bug report on PyTorch: https://github.com/pytorch/pytorch/issues/86456

bhashemian avatar Oct 07 '22 14:10 bhashemian

Thank you! Also, thank you for this wonderful package! :)

dyhan316 avatar Oct 07 '22 14:10 dyhan316

Hi, any plans on fixing this?

ibro45 avatar Jan 13 '23 22:01 ibro45

This requires an upstream fix being discussed here https://github.com/pytorch/pytorch/issues/86456

A workaround would be dropping the metadata of metatensor x using x.as_tensor()

wyli avatar Jan 14 '23 00:01 wyli

Because the bug in the upstream has not yet been fixed, this ticket should be kept. Same as https://github.com/Project-MONAI/MONAI/issues/5283

KumoLiu avatar Dec 20 '23 08:12 KumoLiu