edge-connect icon indicating copy to clipboard operation
edge-connect copied to clipboard

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Open kaelyavel opened this issue 1 year ago • 0 comments

Hello,

I ran into this issue today while trying to train the Inpainting model and the Joint Model on Google Colab with a GPU.

I was able to train the Edge model "successfully" (Because I can't check yet if the training gives correct values) thanks to this #188 that included a working fix to the issue I encountered. But the fix produces the following error while training the Inpainting (Model=2), Joint (Model=4) and Inpainting-Joint (Model=3) models. I tried without the fix (with the vanilla models.py) but it gives back the issue #188 .

Training epoch: 1
144/168 [================>...] - ETA: 4s - epoch: 1 - iter: 18 - psnr: 31.0681 - mae: 0.0096/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [15,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [15,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [15,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
.... [SAME LINE] ....

/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [28,0,0], thread: [30,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [28,0,0], thread: [31,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
  File "train.py", line 2, in <module>
    main(mode=1)
  File "/content/edge-connect/main.py", line 56, in main
    model.train()
  File "/content/edge-connect/src/edge_connect.py", line 145, in train
    outputs, gen_loss, dis_loss, logs = self.inpaint_model.process(images, outputs.detach(), masks)
  File "/content/edge-connect/src/models.py", line 239, in process
    ("l_d2", dis_loss.item()),
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

kaelyavel avatar May 05 '23 13:05 kaelyavel