stylegan2-pytorch icon indicating copy to clipboard operation
stylegan2-pytorch copied to clipboard

RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at ...

Open Louagyd opened this issue 3 years ago • 2 comments

Hello,

I'm running the training with a directory containing ~1000 images, and it seems to start very well, but after some iterations, it gives a cuda runtime error, illegal memory access. I have searched about this issue and couldn't find any solves for it.

I have tried this training with two other datasets, containing ~6000 and ~3000 images, and it worked very well (didn't have this problem)

ali@marlene:~/Bureau/Velours/python/mtr/Gan2$ stylegan2_pytorch --data /data/data_root/GANDatasets/cp_nivea --aug-prob 0.3 --aug-types [translation,cutout,color] --network-capacity 16 --transparent --batch-size 3 --gradient-accumulate-every 5 --name cp_nivea2 --save_every 5000
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|                                                                                             | 0/150000 [00:00<?, ?it/s]G: 595.52 | D: 6.40 | GP: 4.02
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|                                                                                 | 47/150000 [00:42<40:41:36,  1.02it/s]G: 94.29 | D: 6.40 | GP: 0.54
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|                                                                                 | 87/150000 [01:14<36:15:49,  1.15it/s]G: 5.67 | D: 0.57 | GP: 1.34
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|                                                                                | 143/150000 [02:00<33:31:54,  1.24it/s]G: 1.12 | D: 0.61 | GP: 1.30
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|                                                                                | 198/150000 [02:44<33:36:18,  1.24it/s]G: 223.65 | D: 1.38 | GP: 4.07
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▏                                                                               | 239/150000 [03:18<33:30:04,  1.24it/s]G: 69.72 | D: 1.21 | GP: 0.45
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▏                                                                               | 293/150000 [04:02<33:48:05,  1.23it/s]G: 1.91 | D: 1.05 | GP: 10.99
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▏                                                                              | 345/150000 [05:17<422:25:14, 10.16s/it]G: 2.44 | D: 1.83 | GP: 1.97
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▏                                                                              | 387/150000 [06:00<165:44:04,  3.99s/it]G: 4.07 | D: 0.61 | GP: 0.53
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▏                                                                               | 443/150000 [06:35<65:17:49,  1.57s/it]G: 0.92 | D: 1.93 | GP: 1.76
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▎                                                                               | 499/150000 [07:19<40:00:42,  1.04it/s]G: 2.93 | D: 0.86 | GP: 0.37
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▎                                                                               | 537/150000 [07:52<37:27:16,  1.11it/s]G: 0.58 | D: 0.61 | GP: 1.66
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▎                                                                               | 593/150000 [08:37<34:09:21,  1.22it/s]G: 0.43 | D: 0.33 | GP: 0.86
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▎                                                                               | 644/150000 [09:20<33:56:20,  1.22it/s]G: 2.45 | D: 0.51 | GP: 0.87
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▎                                                                               | 685/150000 [09:54<34:23:30,  1.21it/s]THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=29 error=700 : an illegal memory access was encountered
cp_nivea2</data/data_root/GANDatasets/cp_nivea>:   0%|▎                                                                               | 688/150000 [09:57<36:00:28,  1.15it/s]
Traceback (most recent call last):
  File "/home/ali/.local/bin/stylegan2_pytorch", line 8, in <module>
    sys.exit(main())
  File "/home/ali/.local/lib/python3.8/site-packages/stylegan2_pytorch/cli.py", line 179, in main
    fire.Fire(train_from_folder)
  File "/home/ali/.local/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ali/.local/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ali/.local/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ali/.local/lib/python3.8/site-packages/stylegan2_pytorch/cli.py", line 170, in train_from_folder
    run_training(0, 1, model_args, data, load_from, new, num_train_steps, name, seed)
  File "/home/ali/.local/lib/python3.8/site-packages/stylegan2_pytorch/cli.py", line 59, in run_training
    retry_call(model.train, tries=3, exceptions=NanException)
  File "/home/ali/.local/lib/python3.8/site-packages/retry/api.py", line 101, in retry_call
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter, logger)
  File "/home/ali/.local/lib/python3.8/site-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/home/ali/.local/lib/python3.8/site-packages/stylegan2_pytorch/stylegan2_pytorch.py", line 964, in train
    backwards(disc_loss, self.GAN.D_opt, loss_id = 1)
  File "/home/ali/.local/lib/python3.8/site-packages/stylegan2_pytorch/stylegan2_pytorch.py", line 183, in loss_backwards
    loss.backward(**kwargs)
  File "/home/ali/.local/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ali/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:29

UPDATE EDIT: so I found out that the reason is that there is another program that uses GPU memory and this error happens every time that program starts... I don't know the actual reason behind it, but when I use export CUDA_LAUNCH_BLOCKING=1 in the terminal before the training command this error doesn't appear.

Louagyd avatar Dec 13 '20 12:12 Louagyd

I have the same problem and export CUDA_LAUNCH_BLOCKING=1 didn't resolve the issue for me. Anyone else has this problem? Or anywhere to look on to get a starting point to debug this?

krips89 avatar Jul 01 '21 12:07 krips89

I have the same problem and export CUDA_LAUNCH_BLOCKING=1 didn't resolve the issue for me. Anyone else has this problem? Or anywhere to look on to get a starting point to debug this?

I suggest that you use the official NVIDIA release (it also has better performance): https://github.com/NVlabs/stylegan2-ada-pytorch

afotonower avatar Jul 01 '21 13:07 afotonower