spconv
spconv copied to clipboard
RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDeviceFunction: invalid device function
Hi, I encountered two error messages especially when I use Tesla V100 as my cuda. and it doesn't happen when I use NVIDIA A100. since the spconv has higher performance with V100. I would like to ask how to resolve this situation. I'd appreciate it if you could share any comments or experience.
SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attached in a issue.
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_104761/1157574869.py in
/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 887 result = self._slow_forward(*input, **kwargs) 888 else: --> 889 result = self.forward(*input, **kwargs) 890 for hook in itertools.chain( 891 _global_forward_hooks.values(),
/tmp/ipykernel_104761/2908154358.py in forward(self, x) 63 x_sp = spconv.SparseConvTensor(features, indices, spatial_shape, batch) 64 # create SparseConvTensor manually: see SparseConvTensor.from_dense ---> 65 x = self.net(x_sp) 66 x = torch.flatten(x, 1) 67 x = self.regressor(x)
/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 887 result = self._slow_forward(*input, **kwargs) 888 else: --> 889 result = self.forward(*input, **kwargs) 890 for hook in itertools.chain( 891 _global_forward_hooks.values(),
/scratch/users/jhchung1/sparse_cnn/spconv/spconv/pytorch/modules.py in forward(self, input) 135 assert isinstance(input, spconv.SparseConvTensor) 136 # self._sparity_dict[k] = input.sparity --> 137 input = module(input) 138 else: 139 if isinstance(input, spconv.SparseConvTensor):
/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 887 result = self._slow_forward(*input, **kwargs) 888 else: --> 889 result = self.forward(*input, **kwargs) 890 for hook in itertools.chain( 891 _global_forward_hooks.values(),
/scratch/users/jhchung1/sparse_cnn/spconv/spconv/pytorch/conv.py in forward(self, input) 402 print(msg, file=sys.stderr) 403 spconv_save_debug_data(indices) --> 404 raise e 405 406 outids = res[0]
/scratch/users/jhchung1/sparse_cnn/spconv/spconv/pytorch/conv.py in forward(self, input) 379 # because it may be inversed. 380 try: --> 381 res = ops.get_indice_pairs_implicit_gemm( 382 indices, 383 batch_size,
/scratch/users/jhchung1/sparse_cnn/spconv/spconv/pytorch/ops.py in get_indice_pairs_implicit_gemm(indices, batch_size, spatial_shape, algo, ksize, stride, padding, dilation, out_padding, subm, transpose, is_train, alloc, timer) 355 # pytorch binary (c++). 356 # f**k thrust --> 357 SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j], 358 alloc.alloc, 359 mask_argsort_tv[j], stream)
RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDeviceFunction: invalid device function
In [14]:
me too.
Hi @sci-study-storage @JaydencoolCC, have you been able to debug the error?
What's the solution to this issue? I met the same problem
@InfiniteSamele you can try spconv 2.2, this bug is related to thrust, i can't reproduce this in my environment. I need a minimal reproduce script to debug it.
@InfiniteSamele you can try spconv 2.2, this bug is related to thrust, i can't reproduce this in my environment. I need a minimal reproduce script to debug it.
well, this problem happened just like this issue said, it can work on my other services, but when I run my project on an environment with a Tesla V100 as its GPU, it showed me this error, I thought it may caused by some settings are different, and the alloc can't allocate the device. @FindDefinition
@InfiniteSamele I can't reproduce this problem in V100 32G/16G with spconv-cu102/cu114/cu117 (spconv 2.2), test command: python -m spconv.benchmark bench_basic f16
, test nvidia driver: 515. That's why I need a minimal reproduce script.