torchsparse
torchsparse copied to clipboard
RuntimeError: CUDA error: an illegal memory access was encountered
Hi, I am trying to train SPVCNN on a custom dataset. Unfortunately I keep running into this error during training:
File "/home/username/workspace/projects/spvnas/train.py", line 117, in <module>
main()
File "/home/username/workspace/projects/spvnas/train.py", line 93, in main
trainer.train_with_defaults(
File "/opt/conda/lib/python3.10/site-packages/torchpack/train/trainer.py", line 37, in train_with_defaults
self.train(dataflow=dataflow,
File "/opt/conda/lib/python3.10/site-packages/torchpack/train/trainer.py", line 79, in train
output_dict = self.run_step(feed_dict)
File "/opt/conda/lib/python3.10/site-packages/torchpack/train/trainer.py", line 125, in run_step
output_dict = self._run_step(feed_dict)
File "/home/username/workspace/projects/spvnas/core/trainers.py", line 52, in _run_step
outputs = self.model(inputs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/username/workspace/projects/spvnas/core/models/semantic_kitti/spvcnn.py", line 191, in forward
x1 = self.stage1(x1)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/username/workspace/projects/spvnas/core/models/semantic_kitti/spvcnn.py", line 21, in forward
out = self.net(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "torchsparse/nn/modules/conv.pyx", line 99, in torchsparse.nn.modules.conv.Conv3d.forward
File "torchsparse/nn/functional/conv/conv.pyx", line 89, in torchsparse.nn.functional.conv.conv.conv3d
File "torchsparse/nn/functional/conv/kmap/build_kmap.pyx", line 83, in torchsparse.nn.functional.conv.kmap.build_kmap.build_kernel_map
File "torchsparse/nn/functional/conv/kmap/func/hashmap_on_the_fly.pyx", line 84, in torchsparse.nn.functional.conv.kmap.func.hashmap_on_the_fly.build_kmap_implicit_GEMM_hashmap_on_the_fly
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
- The error occurs sporadically. Sometimes the training script runs for multiple epochs and then crushes, some other times it fails after a couple of minutes.
- I have tried to check if there is any error in the input data (empty point clouds, or nan/inf values) but it does not seem the case.
- The GPU usage also seems fine, and the error occurs also when sub-sampling the input point cloud to really small values, so that there is no risk for gpu memory overflow.
- I also checked if there are nan/inf values as input features to the network and that also does not seem the case.
- Interestingly, I managed to run the entire training on the SemanticKITTI dataset using the default config.
I am running torchsparse 2.1.0+torch20cu117
.
Do you have any ideas of what could cause the problem?
Could you please provide a short code snippet that could reproduce this error? Thanks!
Hi, unfortunately I couldn't provide a reproducible minimal example, since the error seems to happen randomly. I have however tested it with different gpus (all 2080Tis) and the error still occurs.
I have also rerun the training using torchsparse 1.4.0 and I had no problems during training using the same data. I am not really sure if this is perhaps related to the way that I installed torchsparse (#228) or something else.
I encountered the same error message, and here is a code snippet:
import numpy as np
import torch
from torch import nn
from torchsparse import SparseTensor
from torchsparse.backbones import SparseResNet21D, SparseResUNet42
from torchsparse.utils.quantize import sparse_quantize
@torch.no_grad()
def main() -> None:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
from torchsparse.nn import functional as F
F.set_kmap_mode("hashmap")
for backbone in [SparseResNet21D, SparseResUNet42]:
print(f"{backbone.__name__}:")
model: nn.Module = backbone(in_channels=4, width_multiplier=1.0)
model = model.to(device).eval()
# generate data
input_size, voxel_size = 10000, 0.2
size = 10
inputs = np.random.uniform(-size, size, size=(input_size, 4))
pcs, feats = inputs[:, :3], inputs
pcs -= np.min(pcs, axis=0, keepdims=True)
pcs, indices = sparse_quantize(pcs, voxel_size, return_index=True)
coords = np.zeros((pcs.shape[0], 4))
coords[:, 1:4] = pcs[:, :3]
coords[:, 0] = 0
coords = torch.as_tensor(coords, dtype=torch.int)
feats = torch.as_tensor(feats[indices], dtype=torch.float)
spatial_range = (1, 2 * size, 2 * size, 2 * size)
input = SparseTensor(coords=coords, feats=feats, spatial_range=spatial_range).to(device)
# forward
outputs = model(input)
# print feature shapes
for k, output in enumerate(outputs):
print(f"output[{k}].F.shape = {output.feats.shape}")
s = output.dense()
print(s.shape)
del s
if __name__ == "__main__":
main()
It seems like the 'output.dense()' caused this problem.
Outputs with error message:
SparseResNet21D:
output[0].F.shape = torch.Size([9959, 16])
torch.Size([1, 20, 20, 20, 16])
output[1].F.shape = torch.Size([168, 32])
torch.Size([1, 20, 20, 20, 32])
output[2].F.shape = torch.Size([111, 64])
torch.Size([1, 20, 20, 20, 64])
output[3].F.shape = torch.Size([24, 128])
torch.Size([1, 20, 20, 20, 128])
output[4].F.shape = torch.Size([16, 128])
torch.Size([1, 20, 20, 20, 128])
SparseResUNet42:
Traceback (most recent call last):
File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/*****/projects/*******/agents/****/test_torchsparse.py", line 20, in main
model = model.to(device).eval()
File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/nn/modules/module.py", line 899, in to
return self._apply(convert)
File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/nn/modules/module.py", line 593, in _apply
param_applied = fn(param)
File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/nn/modules/module.py", line 897, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
python-BaseException
Process finished with exit code 1
Package versions:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0
$ python -c "import torch; print(torch.version.cuda)"
11.1
$ pip list | grep torchsparse
torchsparse 2.1.0+torch110cu111
$ python -c "import torch; print(torch.__version__)"
1.10.1+cu111
Thank you!
@ys-2020, could you please take a look at this issue when you have time? Thanks!
@ys-2020, could you please take a look at this issue when you have time? Thanks!
I also meet this problem.
Under torchsparse 2.1, the cpu utilization is way too high, close to 100%. This doesn't seem normal. (train on spvnas)
Hi all, I started from the code provided by @ZXP-S-works. It seems that the error was caused by the wrong initialization of spatial_range
when initializing the SparseTensor
. Since the spatial_range
is too small, TorchSparse cannot do the .dense()
operation correctly. The modified code is as follows.
import numpy as np
import torch
from torch import nn
from torchsparse import SparseTensor
from torchsparse.backbones import SparseResNet21D, SparseResUNet42
from torchsparse.utils.quantize import sparse_quantize
@torch.no_grad()
def main() -> None:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
from torchsparse.nn import functional as F
F.set_kmap_mode("hashmap")
for backbone in [SparseResNet21D, SparseResUNet42]:
print(f"{backbone.__name__}:")
model: nn.Module = backbone(in_channels=4, width_multiplier=1.0)
model = model.to(device).eval()
# generate data
input_size, voxel_size = 10000, 0.2
size = 10
inputs = np.random.uniform(-size, size, size=(input_size, 4))
pcs, feats = inputs[:, :3], inputs
pcs -= np.min(pcs, axis=0, keepdims=True)
pcs, indices = sparse_quantize(pcs, voxel_size, return_index=True)
coords = np.zeros((pcs.shape[0], 4))
coords[:, 1:4] = pcs[:, :3]
coords[:, 0] = 0
coords = torch.as_tensor(coords, dtype=torch.int)
feats = torch.as_tensor(feats[indices], dtype=torch.float)
coords_range, _ = torch.max(coords, dim=0)
spatial_range = (coords_range + 1)
input = SparseTensor(coords=coords, feats=feats, spatial_range=spatial_range).to(device)
# forward
outputs = model(input)
# print feature shapes
for k, output in enumerate(outputs):
print(f"output[{k}].F.shape = {output.feats.shape}")
s = output.dense()
print(s.shape)
del s
if __name__ == "__main__":
main()
Hi @ys-2020 ,
For torchsparse2.1, when I train spvans, the cpu utilization of all cores is close to 100% no matter what the numworker is. I blocked the model training part and kept only the loop data reading(dataloader )and still have this problem.
However, in version 1.4 or 2.0, there is no such problem.
I've trained on both the local machine and the server and have the same situation.
Also, when installing 2.1, uninstalling it, and reinstalling 2.0, this problem also occurs.
Looking forward to your test and reply!
Hi @ybc-ybc , could you please provide a snippet code for reproduction? Thank you.
Hi @ybc-ybc , could you please provide a snippet code for reproduction? Thank you.
Thank you for your prompt reply!
The code is just the spvnas:https://github.com/mit-han-lab/spvnas/tree/dev/torchsparsepp_backend
I train it on semantickitti on single GPU: python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml
GPU works fine, I did not modify the code(spvnas).
I see. I'll take a look.
I see. I'll take a look.
Did you reproduce the problem?
No. I didn’t observe the same situation on my server. I think this problem might be caused by dataloader library rather than torchsparse. I would suggest you check the version of your dataloader first.
No. I didn’t observe the same situation on my server. I think this problem might be caused by dataloader library rather than torchsparse. I would suggest you check the version of your dataloader first.
I rented a new machine in AutoDL with an environment ( PyTorch 2.0.0, Python 3.8, ubuntu20.04, Cuda 11.8).
After installing torchsparse, I didn't change anything in the environment and then ran spvnas and set batchsize=4 and num_workers=4.
python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml --distributed False
This CPU has 22 cores, and the problem is still there:
Did you find which part leads to the high CPU utilization on your machine? I also noticed a similar situation on a new environment. And it seems that this indeed happens before the sparse model execution. Therefore, I think this issue is not caused by TorchSparse. Instead, it is possible that the dependency renewal when installing TorchSparse causes this problem. Also, I ran the training process of spvnas with TorchSparse v2.1 smoothly. Where is the illegal memory access?
Did you find which part leads to the high CPU utilization on your machine? I also noticed a similar situation on a new environment. And it seems that this indeed happens before the sparse model execution. Therefore, I think this issue is not caused by TorchSparse. Instead, it is possible that the dependency renewal when installing TorchSparse causes this problem. Also, I ran the training process of spvnas with TorchSparse v2.1 smoothly. Where is the illegal memory access?
You're right. In torchsparse2.1, even if I run other algorithms (that don't rely on v2.1), I still get this problem! So it should be that the torchsparse 2.1 installation changed the original library, not sure where yet. Expect this problem to be fixed soon!
Illegal memory access may be related to setting hashmap_on_the_fly
I would suggest you use "hashmap" mode if that is the case.
Since I did not meet the same illegal memory access problem when running the spvnas training manuscript, I am not sure the cause of the problem. Judging from the error message and description provided above, I guess this error might be caused by unexpected inputs to the function build_kmap_implicit_GEMM_hashmap_on_the_fly()
. I suggest to check if there is empty input coordinate tensors to this function if this problem still exists, especially considering the model training can be executed correctly at least for multiple epoches.
I would suggest you use "hashmap" mode if that is the case.
Since I did not meet the same illegal memory access problem when running the spvnas training manuscript, I am not sure the cause of the problem. Judging from the error message and description provided above, I guess this error might be caused by unexpected inputs to the function
build_kmap_implicit_GEMM_hashmap_on_the_fly()
. I suggest to check if there is empty input coordinate tensors to this function if this problem still exists, especially considering the model training can be executed correctly at least for multiple epoches.
Did you find out what's causing the cpu utilization?
I still have this problem after a few training iterations even with the kmap_mode="hashmap". I have not faced this when using the same code in spconv2.3.
Same problem, this situation occurs irregularly during training. For my code, the error occurs in the position of .cpu(), and the code snippet of this part is as follows.
all_coord = torch.cat([a.C, b.C], dim=0)
# max value for hash
max_v = all_coord.max()
# hash value
all_coord = all_coord[:, 0] * (max_v ** 3) + all_coord[:, 1] * (max_v ** 2) + all_coord[:, 2] * (max_v) + all_coord[
:, 3]
all_coord = all_coord.cpu()
Same problem, this situation occurs irregularly during training. For my code, the error occurs in the position of .cpu(), and the code snippet of this part is as follows.
all_coord = torch.cat([a.C, b.C], dim=0) # max value for hash max_v = all_coord.max() # hash value all_coord = all_coord[:, 0] * (max_v ** 3) + all_coord[:, 1] * (max_v ** 2) + all_coord[:, 2] * (max_v) + all_coord[ :, 3] all_coord = all_coord.cpu()
Solved it. It seems that the problem is indeed related to the "hash_map_on_the_fly". Even when I set the mode to "hashmap", in some situations the conv3D still runs the hashmap_on_the_fly. Therefore, I modified the L47 in torchsparse/nn/functional/conv/conv.py
to kmap_mode = "hashmap"
and fixed the problem.