torchsparse
torchsparse copied to clipboard
[BUG] AttributeError: 'SparseTensor' object has no attribute 'is_cuda' for SYNCBATCHNORM
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
I am running the SPVCNN model building using torchsparse. Using torchpack to run the model wrapped as the DistributedDataParallel. This works fine. While experimenting with SyncBatchNorm
model = builder.make_model()
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = torch.nn.parallel.DistributedDataParallel(
model.cuda(),
device_ids=[dist.local_rank()],
find_unused_parameters=True)
I get the following error
File "train.py", line 124, in <module>
main()
File "train.py", line 105, in main
trainer.train_with_defaults(
File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torchpack/train/trainer.py", line 37, in train_with_defaults
self.train(dataflow=dataflow,
File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torchpack/train/trainer.py", line 79, in train
output_dict = self.run_step(feed_dict)
File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torchpack/train/trainer.py", line 125, in run_step
output_dict = self._run_step(feed_dict)
File "/home/menonsandu/point-cloud-segmentation/private-e3d/e3d_pcd/spvnas/core/trainers.py", line 37, in _run_step
outputs = self.model(inputs)
File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/menonsandu/point-cloud-segmentation/private-e3d/e3d_pcd/spvnas/core/models/semantic_kitti/spvcnn.py", line 197, in forward
x0 = self.stem(x0)
File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/menonsandu/point-cloud-segmentation/venv-spvnas/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 486, in forward
if not input.is_cuda:
AttributeError: 'SparseTensor' object has no attribute 'is_cuda'
Expected Behavior
torchsparse models should allow torch.nn.SyncBatchNorm.convert_sync_batchnorm
Environment
- GCC: 8.4.0
- NVCC: 10.2.89
- PyTorch: 1.8.1+cu102
- PyTorch CUDA: 10.2
- TorchSparse: 1.2.0
Anything else?
No response
- TorchSparse: 1.2.0
Is it reproducible in the latest version of torchsparse
?
@digital-idiot Yes I am able to reproduce it in the 1.4.0 version.
@sandeepnmenon Hi!
If you look at the definition of batch norm, you can see that we use the fapply
function to run the forward
with a SparseTensor. https://github.com/mit-han-lab/torchsparse/blob/74099d10a51c71c14318bce63d6421f698b24f24/torchsparse/nn/modules/norm.py#L13
I would suggest you can do the same thing for SyncBatchNorm, then remove convert_sync_batchnorm
and manually change it to your new SyncBatchNorm module in the model definition.
Thanks @CCInc! We may actually define a similar convert_sync_batchnorm
function in TorchSparse to convert BatchNorm
into its synchronized version for both normal and sparse tensors.
@zhijian-liu Great! I will assign it to you for now and if I have time I'll take a look.
Is there any progress on this issue? torchsparse is not available due to this problem when multi-gpu learning with pytorch-lighting.
Is there any progress on this issue? torchsparse is not available due to this problem when multi-gpu learning with pytorch-lighting.
@zhijian-liu @CCInc torchsparse is not available due to this problem when single-gpu learning with pytorch-lighting with no change of model.
Thanks @CCInc! We may actually define a similar
convert_sync_batchnorm
function in TorchSparse to convertBatchNorm
into its synchronized version for both normal and sparse tensors.
I would love to contribute to this feature. Will make a PR in the next 2 days.
@sandeepnmenon Thank you!
Thanks @CCInc! We may actually define a similar
convert_sync_batchnorm
function in TorchSparse to convertBatchNorm
into its synchronized version for both normal and sparse tensors.I would love to contribute to this feature. Will make a PR in the next 2 days.
Any updates?
I am having some trouble with the installation after my Ubuntu 22.04 update. Getting an error that the compiler version is mismatching. explained in this issue #189
Still trying to get around this. When using docker images as well, I am getting a compiler version mismatch error.
I am having some trouble with the installation after my Ubuntu 22.04 update. Getting an error that the compiler version is mismatching. explained in this issue #189
Still trying to get around this. When using docker images as well, I am getting a compiler version mismatch error.
Appreciate your prompt reply! Is there any way to walk around this problem and enable DDP training with Pytorch lightning?
Hi there, I'm having a similar problem. Is there an easy workaround in the meantime?
EDIT: I added the following:
class SparseSyncBatchNorm(nn.SyncBatchNorm):
def forward(self, input: SparseTensor) -> SparseTensor:
return fapply(input, super().forward)
and then called it in my model with SparseSyncBatchNorm(nf, momentum=momentum)
. However, this is incredibly slow. I wonder if there is a way to speed up computation internally with SparseTensors?
Since TorchSparse has been upgraded to v2.1.0, could you please attempt to install the latest version? I will now close this issue, but please don't hesitate to reopen it if the problem persists.