learnable-triangulation-pytorch
learnable-triangulation-pytorch copied to clipboard
Model can train from checkpoint but cannot continue training successively
I have tried training the volumetric model on the CMU dataset, but am encountering more problems with training. The model is able to successfully train an epoch from checkpoint of the previous epoch, but is unable to continue training after the first epoch is trained (starting from the checkpoint).
The main error has got to do with RuntimeError: NCCL communicator was aborted.
.
In case this is useful, the full error stack trace is below:
File "train.py", line 770, in <module>
main(args)
File "train.py", line 727, in main
n_iters_total_train = one_epoch(model, criterion, opt, config, train_dataloader, device, epoch, n_iters_total=n_iters_total_train, is_train=True, master=master, experiment_dir=experiment_dir, writer=writer)
File "train.py", line 398, in one_epoch
total_loss.backward()
File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: NCCL communicator was aborted.
Traceback (most recent call last):
File "/home/scleong/.pyenv/versions/3.6.8/lib/python3.6/runpy.py", line 193, n _run_module_as_main
"__main__", mod_spec)
It's some problem with multi-gpu training.
It's some problem with multi-gpu training.
Oh no, is there a workaround?
Also, if it helps NCCL Debug info:
bigfoot:8514:8514 [0] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8514:8514 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8514:8514 [0] NCCL INFO NET/IB : No device found.
bigfoot:8514:8514 [0] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
NCCL version 2.4.8+cuda10.1
bigfoot:8514:8559 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
Successfully loaded pretrained weights for whole model
Optimising model...
Loading data...
Successfully loaded pretrained weights for whole model
Optimising model...
Loading data...
bigfoot:8516:8516 [2] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8516:8516 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8516:8516 [2] NCCL INFO NET/IB : No device found.
bigfoot:8516:8516 [2] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
bigfoot:8516:8560 [2] NCCL INFO Setting affinity for GPU 2 to 0fff
bigfoot:8515:8515 [1] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8515:8515 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8515:8515 [1] NCCL INFO NET/IB : No device found.
bigfoot:8515:8515 [1] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
bigfoot:8515:8561 [1] NCCL INFO Setting affinity for GPU 1 to 0fff
bigfoot:8514:8559 [0] NCCL INFO Channel 00 : 0 1 2
bigfoot:8515:8561 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
bigfoot:8516:8560 [2] NCCL INFO Ring 00 : 2[2] -> 0[0] via direct shared memory
bigfoot:8514:8559 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
bigfoot:8514:8559 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
bigfoot:8515:8561 [1] NCCL INFO comm 0x7ff038001b40 rank 1 nranks 3 cudaDev 1 nvmlDev 1 - Init COMPLETE
bigfoot:8514:8559 [0] NCCL INFO comm 0x7fdf30001b40 rank 0 nranks 3 cudaDev 0 nvmlDev 0 - Init COMPLETE
bigfoot:8514:8514 [0] NCCL INFO Launch mode Parallel
bigfoot:8516:8560 [2] NCCL INFO comm 0x7f91ec001b40 rank 2 nranks 3 cudaDev 2 nvmlDev 2 - Init COMPLETE