spconv
spconv copied to clipboard
Data exceed int32 range in backward
Hi, author:
I met a problem I can forward data but during the backward I got a runtime error:
[Exception|indice_conv_backward]feat=torch.Size([3921805, 64]),w=torch.Size([192, 3, 3, 3, 64]),pair=torch.Size([2, 27, 3921805]),pairnum=tensor([3757581, 3840413, 3801984, 3836597, 3921543, 3882540, 3797714, 3881843,
3843733, 3757821, 3840671, 3802238, 3836841, 3921805, 3882798, 3797958,
3882105, 3843991, 3756343, 3839160, 3800743, 3835330, 3920260, 3881270,
3796454, 3880567, 3842469], device='cuda:0', dtype=torch.int32),do=torch.Size([7135456, 192])
Traceback (most recent call last):
File "train.py", line 271, in <module>
main(args)
File "train.py", line 172, in main
train_one_epoch(
File "/data/utils/train_utils.py", line 346, in train_one_epoch
amp_scaler.scale(losses['final_loss']).backward()
File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/torch/autograd/function.py", line 399, in wrapper
outputs = fn(ctx, *args)
File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 135, in decorate_bwd
return bwd(*args, **kwargs)
File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/spconv/pytorch/functional.py", line 115, in backward
raise e
File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/spconv/pytorch/functional.py", line 101, in backward
input_bp, filters_bp = ops.indice_conv_backward(features,
File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/spconv/pytorch/ops.py", line 1142, in indice_conv_backward
ConvGemmOps.indice_conv_backward(alloc, ext_mm, GEMM_CPP,
RuntimeError: /io/build/temp.linux-x86_64-cpython-38/spconv/build/core_cc/src/cumm/gemm/main/GemmMainUnitTest/GemmMainUnitTest_matmul_split_Turing_f16f16f16_1.cu(126)
int64_t(a.dim(0)) * int64_t(a.dim(1)) * tv::bit_size(algo_desp.dtype_a) / 8 < int_max assert faild. your data exceed int32 range. this will be fixed in cumm + nvrtc (spconv 2.2/2.3).
If I do some downsample for the data, this code can run. However, I can not increase the batch size, which can not fully utilize the memory.
My spconv version is spconv-cu113 and spconv-cu117 failed, too. Can you please give me some advice to fix this, thank you!
Hello! I have the same problem but in forward mode (gradients on). Were you able to resolve the issue?
Same problem! Is there any solution?