KAIR Error training in DistributedDataParallel mode (main_train

Error training in DistributedDataParallel mode (main_train_psnr.py)

Open bharadwajakshay opened this issue 2 years ago • 1 comments

Hey, I am trying to train the network with remote sensing data for super resolution. I am able to train the network using data parallel method, but when ever I try to train the network with DistributedDataParallel method, the training crashes after saving the model and testing one image. Here is what I get

22-02-09 23:55:15.648 : <epoch: 12, iter:  14,800, lr:2.000e-04> G_loss: 1.523e-02 
22-02-09 23:59:12.943 : <epoch: 12, iter:  15,000, lr:2.000e-04> G_loss: 1.522e-02 
22-02-09 23:59:12.943 : Saving the model.
22-02-09 23:59:22.671 : ---1--> 172312_satx2.png | 32.96dB
Traceback (most recent call last):
  File "/home/akshay/pytorchEnv/KAIR/main_train_psnr.py", line 248, in <module>
    main()
  File "/home/akshay/pytorchEnv/KAIR/main_train_psnr.py", line 185, in main
    model.optimize_parameters(current_step)
  File "/home/akshay/pytorchEnv/KAIR/models/model_plain.py", line 162, in optimize_parameters
    self.G_optimizer.step()
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/optim/adam.py", line 133, in step
    F.adam(params_with_grad,
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/optim/_functional.py", line 86, in adam
    exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at ../aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f8f2bb17d62 in /home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f8f2f24ea1a in /home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f8f2f250b30 in /home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11c (0x7f8f2f2515fc in /home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6de4 (0x7f8f89b56de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9609 (0x7f8f90ed1609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f8f90c9e293 in /lib/x86_64-linux-gnu/libc.so.6)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 366930 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 366931) of binary: /home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/bin/python
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/akshay/.local/share/virtualenvs/pytorchEnv-U_lD4RYp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
main_train_psnr.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-02-09_23:59:28
  host      : lin5
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 366931)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 366931
=======================================================

I'm not sure whats happening, any help is greatly appreciated. Could this be related the test data?

Python Version: 3.9.9 PyTorch Version: 1.10.2+cu102 CUDA Version: Cuda compilation tools, release 11.5, V11.5.119

EDIT1: (Updated no of GPUs used) No of GPUs Used: 2

Feb 10 '22 07:02 bharadwajakshay

Did you find a solution to your problem? Does the following link troubleshoot your issue? https://discuss.pytorch.org/t/cuda-error-device-side-assert-triggered-cuda-kernel-errors-might-be-asynchronously-reported-at-some-other-api-call-so-the-stacktrace-below-might-be-incorrect-for-debugging-consider-passing-cuda-launch-blocking-1/160825

Jun 28 '23 08:06 BeaverInGreenland

KAIR KAIR copied to clipboard

Error training in DistributedDataParallel mode (main_train_psnr.py)

KAIR
KAIR copied to clipboard