STEP CUDA error: an illegal memory access was encountered

CUDA error: an illegal memory access was encountered

Open yushanshan05 opened this issue 4 years ago • 3 comments

hi, thanks for you great works. I train my dataset, which has ten classes, fps =1, and I don't add --fp16 flag. max_iter=2 batch_size=2

But when I start training, there will be the error. This error happens during the third itertator. That means it is ok during the first and the second iterator. The model can forward,backforward and the function of optimizer.step is ok during the first and the second iterator. When the third itertator starts, there throw the error: Traceback (most recent call last): File "train.py", line 602, in main() File "train.py", line 235, in main train(args, nets, optimizer, scheduler, train_dataloader, val_dataloader, log_file) File "train.py", line 362, in train optimizer.step() File "/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py", line 51, in wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/optim/adam.py", line 103, in step denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps']) RuntimeError: CUDA error: an illegal memory access was encountered

Feb 24 '20 08:02 yushanshan05

I am facing the same issue which working on the SPADE code.

Traceback (most recent call last): File "train.py", line 40, in trainer.run_generator_one_step(data_i) File "/home/abhay/inpaint-sa/trainers/pix2pix_trainer.py", line 38, in run_generator_one_step self.optimizer_G.step() File "/home/abhay/miniconda3/envs/pytorch36/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "/home/abhay/miniconda3/envs/pytorch36/lib/python3.6/site-packages/torch/optim/adam.py", line 111, in step denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps']) RuntimeError: CUDA error: an illegal memory access was encountered

Sep 08 '20 07:09 Avashist1998

Excuse me did you solve it

Jan 07 '22 05:01 mathshangw

For me I was a hardware issue. The gpu was getting too hot and crashing, since the fans would not get triggered at higher temperatures.

Jan 08 '22 02:01 Avashist1998

STEP STEP copied to clipboard

CUDA error: an illegal memory access was encountered

STEP
STEP copied to clipboard