STEP icon indicating copy to clipboard operation
STEP copied to clipboard

CUDA error: an illegal memory access was encountered

Open yushanshan05 opened this issue 4 years ago • 3 comments

hi, thanks for you great works. I train my dataset, which has ten classes, fps =1, and I don't add --fp16 flag. max_iter=2 batch_size=2

But when I start training, there will be the error. This error happens during the third itertator. That means it is ok during the first and the second iterator. The model can forward,backforward and the function of optimizer.step is ok during the first and the second iterator. When the third itertator starts, there throw the error: Traceback (most recent call last): File "train.py", line 602, in main() File "train.py", line 235, in main train(args, nets, optimizer, scheduler, train_dataloader, val_dataloader, log_file) File "train.py", line 362, in train optimizer.step() File "/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py", line 51, in wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/optim/adam.py", line 103, in step denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps']) RuntimeError: CUDA error: an illegal memory access was encountered

yushanshan05 avatar Feb 24 '20 08:02 yushanshan05

I am facing the same issue which working on the SPADE code.

Traceback (most recent call last): File "train.py", line 40, in trainer.run_generator_one_step(data_i) File "/home/abhay/inpaint-sa/trainers/pix2pix_trainer.py", line 38, in run_generator_one_step self.optimizer_G.step() File "/home/abhay/miniconda3/envs/pytorch36/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "/home/abhay/miniconda3/envs/pytorch36/lib/python3.6/site-packages/torch/optim/adam.py", line 111, in step denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps']) RuntimeError: CUDA error: an illegal memory access was encountered

Avashist1998 avatar Sep 08 '20 07:09 Avashist1998

Excuse me did you solve it

mathshangw avatar Jan 07 '22 05:01 mathshangw

For me I was a hardware issue. The gpu was getting too hot and crashing, since the fans would not get triggered at higher temperatures.

Avashist1998 avatar Jan 08 '22 02:01 Avashist1998