torchsparse icon indicating copy to clipboard operation
torchsparse copied to clipboard

[BUG] AttributeError: 'SparseTensor' object has no attribute 'is_cuda' for SYNCBATCHNORM

Open sandeepnmenon opened this issue 3 years ago • 6 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

I am running the SPVCNN model building using torchsparse. Using torchpack to run the model wrapped as the DistributedDataParallel. This works fine. While experimenting with SyncBatchNorm

model = builder.make_model()
    model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

    model = torch.nn.parallel.DistributedDataParallel(
        model.cuda(),
        device_ids=[dist.local_rank()],
        find_unused_parameters=True)

I get the following error

File "train.py", line 124, in <module>
   main()
 File "train.py", line 105, in main
   trainer.train_with_defaults(
 File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torchpack/train/trainer.py", line 37, in train_with_defaults
   self.train(dataflow=dataflow,
 File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torchpack/train/trainer.py", line 79, in train
   output_dict = self.run_step(feed_dict)
 File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torchpack/train/trainer.py", line 125, in run_step
   output_dict = self._run_step(feed_dict)
 File "/home/menonsandu/point-cloud-segmentation/private-e3d/e3d_pcd/spvnas/core/trainers.py", line 37, in _run_step
   outputs = self.model(inputs)
 File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
   result = self.forward(*input, **kwargs)
 File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
   output = self.module(*inputs[0], **kwargs[0])
 File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
   result = self.forward(*input, **kwargs)
 File "/home/menonsandu/point-cloud-segmentation/private-e3d/e3d_pcd/spvnas/core/models/semantic_kitti/spvcnn.py", line 197, in forward
   x0 = self.stem(x0)
 File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
   result = self.forward(*input, **kwargs)
 File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
   input = module(input)
 File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
   result = self.forward(*input, **kwargs)
 File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 486, in forward
   if not input.is_cuda:
AttributeError: 'SparseTensor' object has no attribute 'is_cuda'

Expected Behavior

torchsparse models should allow torch.nn.SyncBatchNorm.convert_sync_batchnorm

Environment

- GCC: 8.4.0
- NVCC: 10.2.89
- PyTorch: 1.8.1+cu102
- PyTorch CUDA: 10.2
- TorchSparse: 1.2.0

Anything else?

No response

sandeepnmenon avatar Aug 18 '21 13:08 sandeepnmenon

  • TorchSparse: 1.2.0

Is it reproducible in the latest version of torchsparse?

digital-idiot avatar Aug 18 '21 16:08 digital-idiot

@digital-idiot Yes I am able to reproduce it in the 1.4.0 version.

sandeepnmenon avatar Aug 19 '21 08:08 sandeepnmenon

@sandeepnmenon Hi!

If you look at the definition of batch norm, you can see that we use the fapply function to run the forward with a SparseTensor. https://github.com/mit-han-lab/torchsparse/blob/74099d10a51c71c14318bce63d6421f698b24f24/torchsparse/nn/modules/norm.py#L13

I would suggest you can do the same thing for SyncBatchNorm, then remove convert_sync_batchnorm and manually change it to your new SyncBatchNorm module in the model definition.

CCInc avatar Aug 19 '21 13:08 CCInc

Thanks @CCInc! We may actually define a similar convert_sync_batchnorm function in TorchSparse to convert BatchNorm into its synchronized version for both normal and sparse tensors.

zhijian-liu avatar Aug 22 '21 02:08 zhijian-liu

@zhijian-liu Great! I will assign it to you for now and if I have time I'll take a look.

CCInc avatar Sep 24 '21 16:09 CCInc

Is there any progress on this issue? torchsparse is not available due to this problem when multi-gpu learning with pytorch-lighting.

SFMDI avatar Aug 11 '22 02:08 SFMDI

Is there any progress on this issue? torchsparse is not available due to this problem when multi-gpu learning with pytorch-lighting.

@zhijian-liu @CCInc torchsparse is not available due to this problem when single-gpu learning with pytorch-lighting with no change of model.

LeopoldACC avatar Nov 24 '22 13:11 LeopoldACC

Thanks @CCInc! We may actually define a similar convert_sync_batchnorm function in TorchSparse to convert BatchNorm into its synchronized version for both normal and sparse tensors.

I would love to contribute to this feature. Will make a PR in the next 2 days.

sandeepnmenon avatar Dec 23 '22 22:12 sandeepnmenon

@sandeepnmenon Thank you!

zhijian-liu avatar Dec 24 '22 22:12 zhijian-liu

Thanks @CCInc! We may actually define a similar convert_sync_batchnorm function in TorchSparse to convert BatchNorm into its synchronized version for both normal and sparse tensors.

I would love to contribute to this feature. Will make a PR in the next 2 days.

Any updates?

fengziyue avatar Jan 09 '23 05:01 fengziyue

I am having some trouble with the installation after my Ubuntu 22.04 update. Getting an error that the compiler version is mismatching. explained in this issue #189

Still trying to get around this. When using docker images as well, I am getting a compiler version mismatch error.

sandeepnmenon avatar Jan 09 '23 18:01 sandeepnmenon

I am having some trouble with the installation after my Ubuntu 22.04 update. Getting an error that the compiler version is mismatching. explained in this issue #189

Still trying to get around this. When using docker images as well, I am getting a compiler version mismatch error.

Appreciate your prompt reply! Is there any way to walk around this problem and enable DDP training with Pytorch lightning?

fengziyue avatar Jan 09 '23 19:01 fengziyue

Hi there, I'm having a similar problem. Is there an easy workaround in the meantime?

EDIT: I added the following:

class SparseSyncBatchNorm(nn.SyncBatchNorm):
       def forward(self, input: SparseTensor) -> SparseTensor:
              return fapply(input, super().forward) 

and then called it in my model with SparseSyncBatchNorm(nf, momentum=momentum). However, this is incredibly slow. I wonder if there is a way to speed up computation internally with SparseTensors?

gelnesr avatar Jan 11 '23 00:01 gelnesr

Since TorchSparse has been upgraded to v2.1.0, could you please attempt to install the latest version? I will now close this issue, but please don't hesitate to reopen it if the problem persists.

zhijian-liu avatar Jul 15 '23 01:07 zhijian-liu