torchsparse icon indicating copy to clipboard operation
torchsparse copied to clipboard

RuntimeError: CUDA error: an illegal memory access was encountered

Open aldipiroli opened this issue 1 year ago • 20 comments

Hi, I am trying to train SPVCNN on a custom dataset. Unfortunately I keep running into this error during training:

  File "/home/username/workspace/projects/spvnas/train.py", line 117, in <module>
    main()
  File "/home/username/workspace/projects/spvnas/train.py", line 93, in main
    trainer.train_with_defaults(
  File "/opt/conda/lib/python3.10/site-packages/torchpack/train/trainer.py", line 37, in train_with_defaults
    self.train(dataflow=dataflow,
  File "/opt/conda/lib/python3.10/site-packages/torchpack/train/trainer.py", line 79, in train
    output_dict = self.run_step(feed_dict)
  File "/opt/conda/lib/python3.10/site-packages/torchpack/train/trainer.py", line 125, in run_step
    output_dict = self._run_step(feed_dict)
  File "/home/username/workspace/projects/spvnas/core/trainers.py", line 52, in _run_step
    outputs = self.model(inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/username/workspace/projects/spvnas/core/models/semantic_kitti/spvcnn.py", line 191, in forward
    x1 = self.stage1(x1)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/username/workspace/projects/spvnas/core/models/semantic_kitti/spvcnn.py", line 21, in forward
    out = self.net(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "torchsparse/nn/modules/conv.pyx", line 99, in torchsparse.nn.modules.conv.Conv3d.forward
  File "torchsparse/nn/functional/conv/conv.pyx", line 89, in torchsparse.nn.functional.conv.conv.conv3d
  File "torchsparse/nn/functional/conv/kmap/build_kmap.pyx", line 83, in torchsparse.nn.functional.conv.kmap.build_kmap.build_kernel_map
  File "torchsparse/nn/functional/conv/kmap/func/hashmap_on_the_fly.pyx", line 84, in torchsparse.nn.functional.conv.kmap.func.hashmap_on_the_fly.build_kmap_implicit_GEMM_hashmap_on_the_fly
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

  • The error occurs sporadically. Sometimes the training script runs for multiple epochs and then crushes, some other times it fails after a couple of minutes.
  • I have tried to check if there is any error in the input data (empty point clouds, or nan/inf values) but it does not seem the case.
  • The GPU usage also seems fine, and the error occurs also when sub-sampling the input point cloud to really small values, so that there is no risk for gpu memory overflow.
  • I also checked if there are nan/inf values as input features to the network and that also does not seem the case.
  • Interestingly, I managed to run the entire training on the SemanticKITTI dataset using the default config.

I am running torchsparse 2.1.0+torch20cu117.

Do you have any ideas of what could cause the problem?

aldipiroli avatar Sep 15 '23 09:09 aldipiroli

Could you please provide a short code snippet that could reproduce this error? Thanks!

zhijian-liu avatar Sep 15 '23 14:09 zhijian-liu

Hi, unfortunately I couldn't provide a reproducible minimal example, since the error seems to happen randomly. I have however tested it with different gpus (all 2080Tis) and the error still occurs.

I have also rerun the training using torchsparse 1.4.0 and I had no problems during training using the same data. I am not really sure if this is perhaps related to the way that I installed torchsparse (#228) or something else.

aldipiroli avatar Sep 18 '23 07:09 aldipiroli

I encountered the same error message, and here is a code snippet:

import numpy as np
import torch
from torch import nn

from torchsparse import SparseTensor
from torchsparse.backbones import SparseResNet21D, SparseResUNet42
from torchsparse.utils.quantize import sparse_quantize


@torch.no_grad()
def main() -> None:
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    from torchsparse.nn import functional as F

    F.set_kmap_mode("hashmap")

    for backbone in [SparseResNet21D, SparseResUNet42]:
        print(f"{backbone.__name__}:")
        model: nn.Module = backbone(in_channels=4, width_multiplier=1.0)
        model = model.to(device).eval()

        # generate data
        input_size, voxel_size = 10000, 0.2
        size = 10
        inputs = np.random.uniform(-size, size, size=(input_size, 4))
        pcs, feats = inputs[:, :3], inputs
        pcs -= np.min(pcs, axis=0, keepdims=True)
        pcs, indices = sparse_quantize(pcs, voxel_size, return_index=True)
        coords = np.zeros((pcs.shape[0], 4))
        coords[:, 1:4] = pcs[:, :3]
        coords[:, 0] = 0
        coords = torch.as_tensor(coords, dtype=torch.int)
        feats = torch.as_tensor(feats[indices], dtype=torch.float)
        spatial_range = (1, 2 * size, 2 * size, 2 * size)
        input = SparseTensor(coords=coords, feats=feats, spatial_range=spatial_range).to(device)

        # forward
        outputs = model(input)

        # print feature shapes
        for k, output in enumerate(outputs):
            print(f"output[{k}].F.shape = {output.feats.shape}")
            s = output.dense()
            print(s.shape)
            del s


if __name__ == "__main__":
    main()

It seems like the 'output.dense()' caused this problem.

Outputs with error message:

    
SparseResNet21D:
output[0].F.shape = torch.Size([9959, 16])
torch.Size([1, 20, 20, 20, 16])
output[1].F.shape = torch.Size([168, 32])
torch.Size([1, 20, 20, 20, 32])
output[2].F.shape = torch.Size([111, 64])
torch.Size([1, 20, 20, 20, 64])
output[3].F.shape = torch.Size([24, 128])
torch.Size([1, 20, 20, 20, 128])
output[4].F.shape = torch.Size([16, 128])
torch.Size([1, 20, 20, 20, 128])
SparseResUNet42:
Traceback (most recent call last):
  File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/*****/projects/*******/agents/****/test_torchsparse.py", line 20, in main
    model = model.to(device).eval()
  File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/nn/modules/module.py", line 593, in _apply
    param_applied = fn(param)
  File "/home/*****/anaconda3/envs/****/lib/python3.8/site-packages/torch/nn/modules/module.py", line 897, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
python-BaseException

Process finished with exit code 1

Package versions:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0

$ python -c "import torch; print(torch.version.cuda)"
11.1

$ pip list | grep torchsparse
torchsparse               2.1.0+torch110cu111

$ python -c "import torch; print(torch.__version__)"
1.10.1+cu111

Thank you!

ZXP-S-works avatar Nov 09 '23 18:11 ZXP-S-works

@ys-2020, could you please take a look at this issue when you have time? Thanks!

zhijian-liu avatar Dec 11 '23 04:12 zhijian-liu

@ys-2020, could you please take a look at this issue when you have time? Thanks!

I also meet this problem.

Under torchsparse 2.1, the cpu utilization is way too high, close to 100%. This doesn't seem normal. (train on spvnas)

ybc-ybc avatar Dec 29 '23 13:12 ybc-ybc

Hi all, I started from the code provided by @ZXP-S-works. It seems that the error was caused by the wrong initialization of spatial_range when initializing the SparseTensor. Since the spatial_range is too small, TorchSparse cannot do the .dense() operation correctly. The modified code is as follows.

import numpy as np
import torch
from torch import nn

from torchsparse import SparseTensor
from torchsparse.backbones import SparseResNet21D, SparseResUNet42
from torchsparse.utils.quantize import sparse_quantize


@torch.no_grad()
def main() -> None:
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    from torchsparse.nn import functional as F

    F.set_kmap_mode("hashmap")

    for backbone in [SparseResNet21D, SparseResUNet42]:
        print(f"{backbone.__name__}:")
        model: nn.Module = backbone(in_channels=4, width_multiplier=1.0)
        model = model.to(device).eval()

        # generate data
        input_size, voxel_size = 10000, 0.2
        size = 10
        inputs = np.random.uniform(-size, size, size=(input_size, 4))
        pcs, feats = inputs[:, :3], inputs
        pcs -= np.min(pcs, axis=0, keepdims=True)
        pcs, indices = sparse_quantize(pcs, voxel_size, return_index=True)
        coords = np.zeros((pcs.shape[0], 4))
        coords[:, 1:4] = pcs[:, :3]
        coords[:, 0] = 0
        coords = torch.as_tensor(coords, dtype=torch.int)
        feats = torch.as_tensor(feats[indices], dtype=torch.float)
        coords_range, _ = torch.max(coords, dim=0)
        spatial_range = (coords_range + 1)
        input = SparseTensor(coords=coords, feats=feats, spatial_range=spatial_range).to(device)

        # forward
        outputs = model(input)

        # print feature shapes
        for k, output in enumerate(outputs):
            print(f"output[{k}].F.shape = {output.feats.shape}")
            s = output.dense()
            print(s.shape)
            del s


if __name__ == "__main__":
    main()

ys-2020 avatar Dec 29 '23 14:12 ys-2020

Hi @ys-2020 ,

For torchsparse2.1, when I train spvans, the cpu utilization of all cores is close to 100% no matter what the numworker is. I blocked the model training part and kept only the loop data reading(dataloader )and still have this problem.

However, in version 1.4 or 2.0, there is no such problem.

I've trained on both the local machine and the server and have the same situation.

Also, when installing 2.1, uninstalling it, and reinstalling 2.0, this problem also occurs.

Looking forward to your test and reply!

ybc-ybc avatar Dec 30 '23 03:12 ybc-ybc

Hi @ybc-ybc , could you please provide a snippet code for reproduction? Thank you.

ys-2020 avatar Dec 30 '23 03:12 ys-2020

Hi @ybc-ybc , could you please provide a snippet code for reproduction? Thank you.

Thank you for your prompt reply!

The code is just the spvnas:https://github.com/mit-han-lab/spvnas/tree/dev/torchsparsepp_backend

I train it on semantickitti on single GPU: python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml

GPU works fine, I did not modify the code(spvnas).

截图 2023-12-30 12-28-38

ybc-ybc avatar Dec 30 '23 04:12 ybc-ybc

I see. I'll take a look.

ys-2020 avatar Dec 30 '23 04:12 ys-2020

I see. I'll take a look.

Did you reproduce the problem?

ybc-ybc avatar Dec 31 '23 05:12 ybc-ybc

No. I didn’t observe the same situation on my server. I think this problem might be caused by dataloader library rather than torchsparse. I would suggest you check the version of your dataloader first.

ys-2020 avatar Dec 31 '23 06:12 ys-2020

No. I didn’t observe the same situation on my server. I think this problem might be caused by dataloader library rather than torchsparse. I would suggest you check the version of your dataloader first.

I rented a new machine in AutoDL with an environment ( PyTorch 2.0.0, Python 3.8, ubuntu20.04, Cuda 11.8).

After installing torchsparse, I didn't change anything in the environment and then ran spvnas and set batchsize=4 and num_workers=4.

python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml --distributed False

This CPU has 22 cores, and the problem is still there: Snipaste_2024-01-01_11-11-02

ybc-ybc avatar Jan 01 '24 03:01 ybc-ybc

Did you find which part leads to the high CPU utilization on your machine? I also noticed a similar situation on a new environment. And it seems that this indeed happens before the sparse model execution. Therefore, I think this issue is not caused by TorchSparse. Instead, it is possible that the dependency renewal when installing TorchSparse causes this problem. Also, I ran the training process of spvnas with TorchSparse v2.1 smoothly. Where is the illegal memory access?

ys-2020 avatar Jan 02 '24 12:01 ys-2020

Did you find which part leads to the high CPU utilization on your machine? I also noticed a similar situation on a new environment. And it seems that this indeed happens before the sparse model execution. Therefore, I think this issue is not caused by TorchSparse. Instead, it is possible that the dependency renewal when installing TorchSparse causes this problem. Also, I ran the training process of spvnas with TorchSparse v2.1 smoothly. Where is the illegal memory access?

You're right. In torchsparse2.1, even if I run other algorithms (that don't rely on v2.1), I still get this problem! So it should be that the torchsparse 2.1 installation changed the original library, not sure where yet. Expect this problem to be fixed soon!

Illegal memory access may be related to setting hashmap_on_the_fly

ybc-ybc avatar Jan 02 '24 14:01 ybc-ybc

I would suggest you use "hashmap" mode if that is the case.

Since I did not meet the same illegal memory access problem when running the spvnas training manuscript, I am not sure the cause of the problem. Judging from the error message and description provided above, I guess this error might be caused by unexpected inputs to the function build_kmap_implicit_GEMM_hashmap_on_the_fly(). I suggest to check if there is empty input coordinate tensors to this function if this problem still exists, especially considering the model training can be executed correctly at least for multiple epoches.

ys-2020 avatar Jan 02 '24 14:01 ys-2020

I would suggest you use "hashmap" mode if that is the case.

Since I did not meet the same illegal memory access problem when running the spvnas training manuscript, I am not sure the cause of the problem. Judging from the error message and description provided above, I guess this error might be caused by unexpected inputs to the function build_kmap_implicit_GEMM_hashmap_on_the_fly(). I suggest to check if there is empty input coordinate tensors to this function if this problem still exists, especially considering the model training can be executed correctly at least for multiple epoches.

Did you find out what's causing the cpu utilization?

ybc-ybc avatar Jan 04 '24 06:01 ybc-ybc

I still have this problem after a few training iterations even with the kmap_mode="hashmap". I have not faced this when using the same code in spconv2.3.

hontrn9122 avatar Mar 04 '24 07:03 hontrn9122

Same problem, this situation occurs irregularly during training. For my code, the error occurs in the position of .cpu(), and the code snippet of this part is as follows.

    all_coord = torch.cat([a.C, b.C], dim=0)
    # max value for hash
    max_v = all_coord.max()
    # hash value
    all_coord = all_coord[:, 0] * (max_v ** 3) + all_coord[:, 1] * (max_v ** 2) + all_coord[:, 2] * (max_v) + all_coord[
                                                                                                              :, 3]
    all_coord = all_coord.cpu()

dingfengshi avatar May 10 '24 08:05 dingfengshi

Same problem, this situation occurs irregularly during training. For my code, the error occurs in the position of .cpu(), and the code snippet of this part is as follows.

    all_coord = torch.cat([a.C, b.C], dim=0)
    # max value for hash
    max_v = all_coord.max()
    # hash value
    all_coord = all_coord[:, 0] * (max_v ** 3) + all_coord[:, 1] * (max_v ** 2) + all_coord[:, 2] * (max_v) + all_coord[
                                                                                                              :, 3]
    all_coord = all_coord.cpu()

Solved it. It seems that the problem is indeed related to the "hash_map_on_the_fly". Even when I set the mode to "hashmap", in some situations the conv3D still runs the hashmap_on_the_fly. Therefore, I modified the L47 in torchsparse/nn/functional/conv/conv.py to kmap_mode = "hashmap" and fixed the problem.

dingfengshi avatar May 15 '24 01:05 dingfengshi