spconv icon indicating copy to clipboard operation
spconv copied to clipboard

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDeviceFunction: invalid device function

Open jh-chung1 opened this issue 2 years ago • 2 comments

Hi, I encountered two error messages especially when I use Tesla V100 as my cuda. and it doesn't happen when I use NVIDIA A100. since the spconv has higher performance with V100. I would like to ask how to resolve this situation. I'd appreciate it if you could share any comments or experience.

SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attached in a issue.

RuntimeError Traceback (most recent call last) /tmp/ipykernel_104761/1157574869.py in 13 optimizer.zero_grad() 14 with amp_ctx: ---> 15 pred_perm = model(img) 16 loss = criterior(pred_perm, true_perm) 17 scale = 1.0

/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 887 result = self._slow_forward(*input, **kwargs) 888 else: --> 889 result = self.forward(*input, **kwargs) 890 for hook in itertools.chain( 891 _global_forward_hooks.values(),

/tmp/ipykernel_104761/2908154358.py in forward(self, x) 63 x_sp = spconv.SparseConvTensor(features, indices, spatial_shape, batch) 64 # create SparseConvTensor manually: see SparseConvTensor.from_dense ---> 65 x = self.net(x_sp) 66 x = torch.flatten(x, 1) 67 x = self.regressor(x)

/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 887 result = self._slow_forward(*input, **kwargs) 888 else: --> 889 result = self.forward(*input, **kwargs) 890 for hook in itertools.chain( 891 _global_forward_hooks.values(),

/scratch/users/jhchung1/sparse_cnn/spconv/spconv/pytorch/modules.py in forward(self, input) 135 assert isinstance(input, spconv.SparseConvTensor) 136 # self._sparity_dict[k] = input.sparity --> 137 input = module(input) 138 else: 139 if isinstance(input, spconv.SparseConvTensor):

/share/software/user/open/py-pytorch/1.8.1_py39/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 887 result = self._slow_forward(*input, **kwargs) 888 else: --> 889 result = self.forward(*input, **kwargs) 890 for hook in itertools.chain( 891 _global_forward_hooks.values(),

/scratch/users/jhchung1/sparse_cnn/spconv/spconv/pytorch/conv.py in forward(self, input) 402 print(msg, file=sys.stderr) 403 spconv_save_debug_data(indices) --> 404 raise e 405 406 outids = res[0]

/scratch/users/jhchung1/sparse_cnn/spconv/spconv/pytorch/conv.py in forward(self, input) 379 # because it may be inversed. 380 try: --> 381 res = ops.get_indice_pairs_implicit_gemm( 382 indices, 383 batch_size,

/scratch/users/jhchung1/sparse_cnn/spconv/spconv/pytorch/ops.py in get_indice_pairs_implicit_gemm(indices, batch_size, spatial_shape, algo, ksize, stride, padding, dilation, out_padding, subm, transpose, is_train, alloc, timer) 355 # pytorch binary (c++). 356 # f**k thrust --> 357 SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j], 358 alloc.alloc, 359 mask_argsort_tv[j], stream)

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDeviceFunction: invalid device function

In [14]:

jh-chung1 avatar Apr 03 '22 22:04 jh-chung1

me too.

JaydencoolCC avatar May 19 '22 06:05 JaydencoolCC

Hi @sci-study-storage @JaydencoolCC, have you been able to debug the error?

tarmas99 avatar Jul 12 '22 09:07 tarmas99

What's the solution to this issue? I met the same problem

InfiniteSamele avatar Sep 26 '22 02:09 InfiniteSamele

@InfiniteSamele you can try spconv 2.2, this bug is related to thrust, i can't reproduce this in my environment. I need a minimal reproduce script to debug it.

FindDefinition avatar Sep 26 '22 06:09 FindDefinition

@InfiniteSamele you can try spconv 2.2, this bug is related to thrust, i can't reproduce this in my environment. I need a minimal reproduce script to debug it.

well, this problem happened just like this issue said, it can work on my other services, but when I run my project on an environment with a Tesla V100 as its GPU, it showed me this error, I thought it may caused by some settings are different, and the alloc can't allocate the device. @FindDefinition

InfiniteSamele avatar Sep 26 '22 07:09 InfiniteSamele

@InfiniteSamele I can't reproduce this problem in V100 32G/16G with spconv-cu102/cu114/cu117 (spconv 2.2), test command: python -m spconv.benchmark bench_basic f16, test nvidia driver: 515. That's why I need a minimal reproduce script.

FindDefinition avatar Sep 26 '22 07:09 FindDefinition