spconv
spconv copied to clipboard
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
I tried to run example/mnist/mnist_sparse.py, but failed with error:
[Exception|implicit_gemm_pair]indices=torch.Size([4656, 3]),bs=32,ss=[28, 28],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3],stride=[1, 1],padding=[0, 0],dilation=[1, 1],subm=True,transpose=False
SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attached in a issue.
Traceback (most recent call last):
File "/home/chenhai-fwxz/anaconda3/envs/spconv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/chenhai-fwxz/anaconda3/envs/spconv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
cli.main()
File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="__main__")
File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "/home/chenhai-fwxz/zch/spconv/example/mnist/mnist_sparse.py", line 235, in <module>
main()
File "/home/chenhai-fwxz/zch/spconv/example/mnist/mnist_sparse.py", line 226, in main
train(args, model, device, train_loader, optimizer, epoch)
File "/home/chenhai-fwxz/zch/spconv/example/mnist/mnist_sparse.py", line 75, in train
output = model(data)
File "/home/chenhai-fwxz/anaconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/chenhai-fwxz/zch/spconv/example/mnist/mnist_sparse.py", line 54, in forward
x = self.net(x_sp)
File "/home/chenhai-fwxz/anaconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/chenhai-fwxz/zch/spconv/spconv/pytorch/modules.py", line 138, in forward
input = module(input)
File "/home/chenhai-fwxz/anaconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/chenhai-fwxz/zch/spconv/spconv/pytorch/conv.py", line 755, in forward
return self._conv_forward(self.training,
File "/home/chenhai-fwxz/zch/spconv/spconv/pytorch/conv.py", line 408, in _conv_forward
raise e
File "/home/chenhai-fwxz/zch/spconv/spconv/pytorch/conv.py", line 385, in _conv_forward
res = ops.get_indice_pairs_implicit_gemm(
File "/home/chenhai-fwxz/zch/spconv/spconv/pytorch/ops.py", line 550, in get_indice_pairs_implicit_gemm
SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j],
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
env infomation:
spconv=2.23.6 # built from source
cumm=0.411
cuda=11.3
torch=1.12.1
gpu=NVIDIA TITAN X (Pascal)
script:
cd example/mnist
python mnist_sparse.py
Really hope someone can help
Got it as well, did you solve it?
Traceback (most recent call last):
File "/home/aify/anaconda3/envs/VILNet/lib/python3.9/site-packages/spconv/pytorch/conv.py", line 385, in _conv_forward
res = ops.get_indice_pairs_implicit_gemm(
File "/home/aify/anaconda3/envs/VILNet/lib/python3.9/site-packages/spconv/pytorch/ops.py", line 550, in get_indice_pairs_implicit_gemm
SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j],
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/aify/anaconda3/envs/VILNet/lib/python3.9/site-packages/spconv/pytorch/conv.py", line 402, in _conv_forward
msg += f"indices={indices.shape},bs={batch_size},ss={spatial_shape},"
File "/home/aify/anaconda3/envs/VILNet/lib/python3.9/site-packages/torch/_tensor.py", line 872, in __format__
return self.item().__format__(format_spec)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
did you solve it? I run my code on cuda:0 and it workied well. However, when I changed to use cuda:1, it occured this problems.
I have the same problem. It only works well on cuda:0. Meanwhile, it cannot work well with timm.