NeRFusion icon indicating copy to clipboard operation
NeRFusion copied to clipboard

RuntimeError: CUDA error: invalid configuration argument

Open jiangxf0929 opened this issue 3 years ago • 9 comments

Hi, Thanks for sharing your code. When I tested scene scene0000_01 using Scannet0000_01.ckpt, there was a CUDA error. Can you help me ? Thanks python train.py --dataset_name scannet --root_dir /data/scannet/scans/scene0000_01 --exp_name try_pretrain_scannnet0000_01 --val_only --ckpt_path ./Scannet0000_01.ckpt

File "/home/NeRFusion/models/nerfusion.py", line 161, in sample_uniform_and_occupied_cells cells += [(torch.cat([indices1, indices2]), torch.cat([coords1, coords2]))] RuntimeError: CUDA error: invalid configuration argument CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

jiangxf0929 avatar Oct 25 '22 02:10 jiangxf0929

Thanks for this feedback. Can you provide more information? Are you using the provided data and checkpoint? How many computing devices are you using? Also could you try with CUDA_LAUNCH_BLOCKING=1 as suggested in the error message and post the logs here?

jetd1 avatar Jan 07 '23 06:01 jetd1

I have the exact same issue. Yes, I used your instructions here. image So I used your data and your checkpoint exactly same as your provided. I only use one GPU that's Tesla V100-PCIE-32GB.

weizhenFrank avatar Jan 16 '23 04:01 weizhenFrank

Here's my env info

`PyTorch version: 1.12.1 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64) GCC version: (wliu25-vector2) 7.5.0 Clang version: Could not collect CMake version: version 3.25.1 Libc version: glibc-2.17

Python version: 3.8.15 (default, Nov 24 2022, 15:19:38) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-1127.18.2.el7.x86_64-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 10.2.89 GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB GPU 1: Tesla V100-PCIE-32GB GPU 2: Tesla V100-PCIE-32GB GPU 3: Tesla V100-PCIE-32GB GPU 4: Tesla V100-PCIE-32GB GPU 5: Tesla V100-PCIE-32GB GPU 6: Tesla V100-PCIE-32GB GPU 7: Tesla V100-PCIE-32GB

Nvidia driver version: 440.95.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.23.5 [pip3] pytorch-lightning==1.7.7 [pip3] torch==1.12.1 [pip3] torch-scatter==2.1.0 [pip3] torchaudio==0.12.1 [pip3] torchmetrics==0.11.0 [pip3] torchsparse==1.4.0 [pip3] torchvision==0.13.1 [conda] blas 1.0 mkl [conda] cudatoolkit 10.2.89 hfd86e86_1 [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.4.0 h06a4308_640 [conda] mkl-service 2.4.0 py38h7f8727e_0 [conda] mkl_fft 1.3.1 py38hd3c417c_0 [conda] mkl_random 1.2.2 py38h51133e4_0 [conda] numpy 1.23.5 py38h14f4228_0 [conda] numpy-base 1.23.5 py38h31eccc5_0 [conda] pytorch 1.12.1 py3.8_cuda10.2_cudnn7.6.5_0 pytorch [conda] pytorch-lightning 1.7.7 pypi_0 pypi [conda] pytorch-mutex 1.0 cuda pytorch [conda] pytorch-scatter 2.1.0 py38_torch_1.12.0_cu102 pyg [conda] torchaudio 0.12.1 py38_cu102 pytorch [conda] torchmetrics 0.11.0 pypi_0 pypi [conda] torchsparse 1.4.0 pypi_0 pypi [conda] torchvision 0.13.1 py38_cu102 pytorch`

weizhenFrank avatar Jan 16 '23 04:01 weizhenFrank

And there's no big change after adding CUDA_LAUNCH_BLOCKING=1, i.e. image

weizhenFrank avatar Jan 16 '23 04:01 weizhenFrank

I guess it might relate to

  1. CUDA version, I'm on 10.2,
  2. NVIDIA GPU with Compute Compatibility, I'm on less 75

here's a issue #https://github.com/pytorch/pytorch/issues/48573

weizhenFrank avatar Jan 16 '23 04:01 weizhenFrank

Hi,sir, I'm wondering whether there's any update for my problem? I would appreciate it a lot if you could help me.

weizhenFrank avatar Jan 19 '23 22:01 weizhenFrank

I got the same results, guess you could check the issue mentioned in https://github.com/kwea123/ngp_pl/issues/95 as the author uses the code of ngp_pl, and https://github.com/jetd1/NeRFusion/issues/8#issue-1409552026 also shows similar results with 0 size. Maybe scene0000_01.ckpt has something wrong... And G.ckpt cannot be loaded...

HengyiWang avatar Jan 20 '23 18:01 HengyiWang

I use cu113+torch12.0.1, and installed dependencies, but an error occurs when importing vren: vren.cpython-38-x86_64-linux-gnu.so: undefined symbol: _zn2at4_ops5zeros4callen3c108arrayrefins2_6syminteeens2_8optionalins2_10scalartypeeeens6_ins2_6layouteeens6_ins2_6deviceeeens6_ibee >>> Can anyone tell me how to do this?

Bin-ze avatar Mar 01 '23 10:03 Bin-ze

@jetd1 I have the exact same issue. Can you upload a good pre-trained weight again?

Robot-zeg avatar Mar 29 '23 16:03 Robot-zeg