mvsnerf icon indicating copy to clipboard operation
mvsnerf copied to clipboard

CUDA error

Open YuhsiHu opened this issue 2 years ago • 2 comments

Thank you for your great work! When I tried to train the network using DTU dataset, there was a CUDA error during the train process. Could you please tell me why this happens? Thank you!

Found ckpts [] GPU available: True, used: True TPU available: False, using: 0 TPU cores /home/hyx/anaconda3/envs/mvsnerf/lib/python3.8/site-packages/numpy/core/shape_base.py:420: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. arrays = [asanyarray(arr) for arr in arrays] ==> image down scale: 1.0 ==> image down scale: 1.0 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Validation sanity check: 0it [00:00, ?it/s]/home/hyx/anaconda3/envs/mvsnerf/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 16 which is the number of cpus on this machine) in theDataLoader` init to improve performance. warnings.warn(*args, **kwargs) Validation sanity check: 0%| | 0/1 [00:00<?, ?it/s]/home/hyx/anaconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Epoch 0: 66%|▋| 19999/30194 [6:09:42<3:08:28, 1.11s/it, loss=0.00857, v_num=0, train/loss=Saved checkpoints at runs_new/myexp/ckpts//19999.tar Epoch 1: 0%| | 0/30194 [00:00<?, ?it/s, loss=0.00732, v_num=0, train/loss=0.00123, train/Pterminate called after throwing an instance of 'terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::CUDAErrorterminate called after throwing an instance of 'c10::CUDAErrorc10::CUDAError' ' c10::CUDAError' ' terminate called after throwing an instance of 'terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::CUDAErrorc10::CUDAErrorc10::CUDAError' ' ' what(): CUDA error: initialization error CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from getDevice at ../c10/cuda/impl/CUDAGuardImpl.h:39 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7efe89e687d2 in /home/hyx/anaconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x267dc82 (0x7efedce87c82 in /home/hyx/anaconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: + 0x300568 (0x7eff3f280568 in /home/hyx/anaconda3/envs/mvsnerf/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

YuhsiHu avatar Jun 15 '22 02:06 YuhsiHu

Have you resolved this issue, seems I didn't occur a similar problem so don't have any suggestions from this error log.

apchenstu avatar Jun 30 '22 02:06 apchenstu

No. My machine has one 3060 GPU. The training process is cool but this erros occues when I finetune if I use number of epochs larger than 1.

YuhsiHu avatar Jul 28 '22 07:07 YuhsiHu