spconv icon indicating copy to clipboard operation
spconv copied to clipboard

Data exceed int32 range in backward

Open dingfengshi opened this issue 1 year ago • 2 comments

Hi, author:

I met a problem I can forward data but during the backward I got a runtime error:

[Exception|indice_conv_backward]feat=torch.Size([3921805, 64]),w=torch.Size([192, 3, 3, 3, 64]),pair=torch.Size([2, 27, 3921805]),pairnum=tensor([3757581, 3840413, 3801984, 3836597, 3921543, 3882540, 3797714, 3881843,
        3843733, 3757821, 3840671, 3802238, 3836841, 3921805, 3882798, 3797958,
        3882105, 3843991, 3756343, 3839160, 3800743, 3835330, 3920260, 3881270,
        3796454, 3880567, 3842469], device='cuda:0', dtype=torch.int32),do=torch.Size([7135456, 192])


Traceback (most recent call last):                                                                                                                                                                                                                  
  File "train.py", line 271, in <module>
    main(args)
  File "train.py", line 172, in main
    train_one_epoch(
  File "/data/utils/train_utils.py", line 346, in train_one_epoch
    amp_scaler.scale(losses['final_loss']).backward()
  File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/torch/autograd/function.py", line 399, in wrapper
    outputs = fn(ctx, *args)
  File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 135, in decorate_bwd
    return bwd(*args, **kwargs)
  File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/spconv/pytorch/functional.py", line 115, in backward
    raise e
  File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/spconv/pytorch/functional.py", line 101, in backward
    input_bp, filters_bp = ops.indice_conv_backward(features,
  File "/usr/local/lib/miniconda3/envs/cloud-ai-lab/lib/python3.8/site-packages/spconv/pytorch/ops.py", line 1142, in indice_conv_backward
    ConvGemmOps.indice_conv_backward(alloc, ext_mm, GEMM_CPP,
RuntimeError: /io/build/temp.linux-x86_64-cpython-38/spconv/build/core_cc/src/cumm/gemm/main/GemmMainUnitTest/GemmMainUnitTest_matmul_split_Turing_f16f16f16_1.cu(126)
int64_t(a.dim(0)) * int64_t(a.dim(1)) * tv::bit_size(algo_desp.dtype_a) / 8 < int_max assert faild. your data exceed int32 range. this will be fixed in cumm + nvrtc (spconv 2.2/2.3).

If I do some downsample for the data, this code can run. However, I can not increase the batch size, which can not fully utilize the memory.

My spconv version is spconv-cu113 and spconv-cu117 failed, too. Can you please give me some advice to fix this, thank you!

dingfengshi avatar Nov 08 '23 09:11 dingfengshi

Hello! I have the same problem but in forward mode (gradients on). Were you able to resolve the issue?

anuartask avatar May 09 '24 16:05 anuartask

Same problem! Is there any solution?

lyhsieh avatar Jun 23 '24 03:06 lyhsieh