pysot icon indicating copy to clipboard operation
pysot copied to clipboard

CUDA error: device-side assert triggered at

Open sirius541 opened this issue 3 years ago • 4 comments

When I run train.py in folder 'siamrpn_alex_dwxcorr_otb', there is an error——’RuntimeError: cuda runtime error (710) : device-side assert triggered at C:/w/b/windows/pytorch/aten/src\THCUNN/generic/ClassNLLCriterion.cu:115‘ . Does anyone know how to solve it?

sirius541 avatar Mar 27 '21 01:03 sirius541

Currently it does not support Windows

ZhiyuanChen avatar Apr 27 '21 09:04 ZhiyuanChen

When I run train.py on Ubuntu, I also meet this problem,The error message is as follows: /opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [4,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.

... ... ... Traceback (most recent call last): File "tools/train.py", line 346, in main() File "tools/train.py", line 336, in main train(train_loader, dist_model, optimizer, lr_scheduler, tb_writer) File "tools/train.py", line 217, in train outputs = model(data) File "/home/wy/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/media/wy/ee9c6cc3-4234-40f0-bc7c-fe7854a554d6/xiaowei/pysot/pysot/utils/distributed.py", line 43, in forward return self.module(*args, **kwargs) File "/home/wy/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/media/wy/ee9c6cc3-4234-40f0-bc7c-fe7854a554d6/xiaowei/pysot/pysot/models/model_builder.py", line 102, in forward loc_loss = weight_l1_loss(loc, label_loc, label_loc_weight) File "/media/wy/ee9c6cc3-4234-40f0-bc7c-fe7854a554d6/xiaowei/pysot/pysot/models/loss.py", line 34, in weight_l1_loss diff = (pred_loc - label_loc).abs() RuntimeError: The size of tensor a (17) must match the size of tensor b (25) at non-singleton dimension 4 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp line=195 error=59 : device-side assert triggered Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/wy/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/distributed/init.py", line 41, in destroy_process_group torch._C._dist_destroy_process_group() RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:195 at /opt/conda/conda-bld/pytorch_1535493744281/work/torch/lib/THD/process_group/General.cpp:26

If someone can help me, I will be very grateful!

LearnByDoingXW avatar May 09 '21 07:05 LearnByDoingXW

我也是 在ubuntu系统的错误

File "/home/xxc/project/GeekPlusA-ai-pysot-master/pysot/pysot/models/loss.py", line 16, in get_cls_loss pred = torch.index_select(pred, 0, select)

RuntimeError: CUDA error: device-side assert triggered

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22772) of binary: /home/xxc/miniconda3/envs/pysot/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group

xuxiangchen avatar Oct 17 '21 07:10 xuxiangchen

Currently it does not support Windows

I can run it on Win11 with Cuda or on CPU in a conda environment:

GPU Case first check your CUDA version, my is e.g.: 11.6 nvcc --version

Next I updated my Conda environment by: conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.6 -c pytorch -c conda-forge

The following packages will be UPDATED: ca-certificates pkgs/main::ca-certificates-2022.10.11~ --> conda-forge::ca-certificates-2022.12.7-h5b45459_0 None certifi pkgs/main/win-64::certifi-2022.9.24-p~ --> conda-forge/noarch::certifi-2022.12.7-pyhd8ed1ab_0 None pytorch 0.4.1-py37_cuda90_cudnn7he774522_1 --> 1.12.0-py3.7_cuda11.6_cudnn8_0 None torchvision pytorch/noarch::torchvision-0.2.1-py_2 --> pytorch/win-64::torchvision-0.13.0-py37_cu116 None

That's it. The model will run on GPU.

Tested with: python tools/demo.py --config experiments/siamrpn_mobilev2_l234_dwxcorr/config.yaml --snapshot experiments/siamrpn_mobilev2_l234_dwxcorr/model.pth --video demo/bag.avi

CPU Case If you want to run it on CPU only, you do not need to update packages. Just go to demo.py and insert the following line on top of main() cfg.CUDA = False

e.g.

def main():
    # load config
    cfg.CUDA = False 
    cfg.merge_from_file(args.config)
...

Robomate avatar Dec 19 '22 13:12 Robomate