learnable-triangulation-pytorch icon indicating copy to clipboard operation
learnable-triangulation-pytorch copied to clipboard

Model can train from checkpoint but cannot continue training successively

Open Samleo8 opened this issue 4 years ago • 2 comments

I have tried training the volumetric model on the CMU dataset, but am encountering more problems with training. The model is able to successfully train an epoch from checkpoint of the previous epoch, but is unable to continue training after the first epoch is trained (starting from the checkpoint).

The main error has got to do with RuntimeError: NCCL communicator was aborted..


In case this is useful, the full error stack trace is below:

  File "train.py", line 770, in <module>
    main(args)
  File "train.py", line 727, in main
    n_iters_total_train = one_epoch(model, criterion, opt, config, train_dataloader, device, epoch, n_iters_total=n_iters_total_train, is_train=True, master=master, experiment_dir=experiment_dir, writer=writer)
  File "train.py", line 398, in one_epoch
    total_loss.backward()
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: NCCL communicator was aborted.
Traceback (most recent call last):
  File "/home/scleong/.pyenv/versions/3.6.8/lib/python3.6/runpy.py", line 193, n _run_module_as_main
   "__main__", mod_spec)

Samleo8 avatar Jun 03 '20 13:06 Samleo8

It's some problem with multi-gpu training.

karfly avatar Jun 03 '20 18:06 karfly

It's some problem with multi-gpu training.

Oh no, is there a workaround?

Also, if it helps NCCL Debug info:

bigfoot:8514:8514 [0] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8514:8514 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8514:8514 [0] NCCL INFO NET/IB : No device found.
bigfoot:8514:8514 [0] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
NCCL version 2.4.8+cuda10.1
bigfoot:8514:8559 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
Successfully loaded pretrained weights for whole model
Optimising model...
Loading data...
Successfully loaded pretrained weights for whole model
Optimising model...
Loading data...
bigfoot:8516:8516 [2] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8516:8516 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8516:8516 [2] NCCL INFO NET/IB : No device found.
bigfoot:8516:8516 [2] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
bigfoot:8516:8560 [2] NCCL INFO Setting affinity for GPU 2 to 0fff
bigfoot:8515:8515 [1] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8515:8515 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8515:8515 [1] NCCL INFO NET/IB : No device found.
bigfoot:8515:8515 [1] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
bigfoot:8515:8561 [1] NCCL INFO Setting affinity for GPU 1 to 0fff
bigfoot:8514:8559 [0] NCCL INFO Channel 00 :    0   1   2
bigfoot:8515:8561 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
bigfoot:8516:8560 [2] NCCL INFO Ring 00 : 2[2] -> 0[0] via direct shared memory
bigfoot:8514:8559 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
bigfoot:8514:8559 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
bigfoot:8515:8561 [1] NCCL INFO comm 0x7ff038001b40 rank 1 nranks 3 cudaDev 1 nvmlDev 1 - Init COMPLETE
bigfoot:8514:8559 [0] NCCL INFO comm 0x7fdf30001b40 rank 0 nranks 3 cudaDev 0 nvmlDev 0 - Init COMPLETE
bigfoot:8514:8514 [0] NCCL INFO Launch mode Parallel
bigfoot:8516:8560 [2] NCCL INFO comm 0x7f91ec001b40 rank 2 nranks 3 cudaDev 2 nvmlDev 2 - Init COMPLETE

Samleo8 avatar Jun 04 '20 03:06 Samleo8