hi, thanks for you great works.
I train my dataset, which has ten classes, fps =1, and I don't add --fp16 flag.
max_iter=2
batch_size=2
But when I start training, there will be the error. This error happens during the third itertator. That means it is ok during the first and the second iterator. The model can forward,backforward and the function of optimizer.step is ok during the first and the second iterator. When the third itertator starts, there throw the error:
Traceback (most recent call last):
File "train.py", line 602, in
main()
File "train.py", line 235, in main
train(args, nets, optimizer, scheduler, train_dataloader, val_dataloader, log_file)
File "train.py", line 362, in train
optimizer.step()
File "/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py", line 51, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/optim/adam.py", line 103, in step
denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])
RuntimeError: CUDA error: an illegal memory access was encountered
I am facing the same issue which working on the SPADE code.
Traceback (most recent call last):
File "train.py", line 40, in
trainer.run_generator_one_step(data_i)
File "/home/abhay/inpaint-sa/trainers/pix2pix_trainer.py", line 38, in run_generator_one_step
self.optimizer_G.step()
File "/home/abhay/miniconda3/envs/pytorch36/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/abhay/miniconda3/envs/pytorch36/lib/python3.6/site-packages/torch/optim/adam.py", line 111, in step
denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])
RuntimeError: CUDA error: an illegal memory access was encountered
Excuse me did you solve it
For me I was a hardware issue. The gpu was getting too hot and crashing, since the fans would not get triggered at higher temperatures.