assem-vc icon indicating copy to clipboard operation
assem-vc copied to clipboard

Possible bottleneck?

Open Vadim2S opened this issue 4 years ago • 3 comments

I am got warning:

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: Dataloader(num_workers>0) and ddp_spawn do not mix well! Your performance might suffer dramatically. Please consider setting distributed_backend=ddp to use num_workers > 0 (this is a bottleneck of Python .spawn() and PyTorch

is this Ok?

Vadim2S avatar Jul 12 '21 07:07 Vadim2S

Hi. In my case, I tried using distributed_backend='ddp' as that warning recommended. However, multi-GPU training error occurs in the following situations:

  • when the first GPU (i.e. ID 0) is not included in the GPUs list. For example: python synthesizer_trainer.py -g 1,2,3
  • when the GPUs list is not sequential. For example: python synthesizer_trainer.py -g 0,2,3

About the issue mentioned above, see https://github.com/PyTorchLightning/pytorch-lightning/issues/4171

This error is caused by pytorch-lightning and can be resolved by upgrading the version.

As the error said, using DDP and num_workers>0 at once makes initializing and training speed faster. If you want speed-up in the current setting,

  1. Change accelerator='None' to 'ddp' in synthesizer_trainer.py and cotatron_trainer.py
  2. After that, if you want to use GPU number 1, 2, 4, You can use it like CUDA_VISIBLE_DEVICES=1,2,4 python3 synthesizer_trainer.py instead of the gpu option using -g 1,2,4.

wookladin avatar Jul 12 '21 09:07 wookladin

In order to completely solve this problem, we need a version up of the PyTorch Lightning module. However, there are conflicts between the pl versions, so we plan to check them carefully. Thank you for sharing the issue!

wookladin avatar Jul 12 '21 09:07 wookladin

Unfortunate, accelerator='ddp' is not stable. Accelerator='None' is OK.

File "/home/assem-vc/synthesizer_trainer.py", line 85, in main(args) File "/home/assem-vc/synthesizer_trainer.py", line 64, in main trainer.fit(model) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit results = self.accelerator_backend.train() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train results = self.ddp_train(process_idx=self.task_idx, model=model) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 279, in ddp_train results = self.train_or_test() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test results = self.trainer.train() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 482, in train self.train_loop.run_training_epoch() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch self.accumulated_loss.append(opt_closure_result.loss) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/supporters.py", line 64, in append x = x.to(self.memory) RuntimeError: CUDA error: the launch timed out and was terminated Exception in thread Thread-22: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, *self._kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get terminate called after throwing an instance of 'c10::Error' what(): CUDA error: the launch timed out and was terminated (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3f13358193 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so) frame #1: + 0x17f66 (0x7f3f13595f66 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so) frame #2: + 0x19cbd (0x7f3f13597cbd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7f3f1334863d in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so) frame #4: c10d::Reducer::~Reducer() + 0x449 (0x7f3eff7e9b89 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #5: std::_Sp_counted_ptr<c10d::Reducer, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f3eff7cb592 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f3eff034e56 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #7: + 0x9e813b (0x7f3eff7cc13b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #8: + 0x293f30 (0x7f3eff077f30 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #9: + 0x2951ce (0x7f3eff0791ce in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #10: /usr/bin/python3() [0x5d1ca7] frame #11: /usr/bin/python3() [0x5a605d] frame #12: /usr/bin/python3() [0x5d1ca7] frame #13: /usr/bin/python3() [0x5a3132] frame #14: /usr/bin/python3() [0x4ef828] frame #15: _PyGC_CollectNoFail + 0x2f (0x6715cf in /usr/bin/python3) frame #16: PyImport_Cleanup + 0x244 (0x683bf4 in /usr/bin/python3) frame #17: Py_FinalizeEx + 0x7f (0x67eaef in /usr/bin/python3) frame #18: Py_RunMain + 0x32d (0x6b624d in /usr/bin/python3) frame #19: Py_BytesMain + 0x2d (0x6b64bd in /usr/bin/python3) frame #20: __libc_start_main + 0xf3 (0x7f3f1f2e30b3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #21: _start + 0x2e (0x5f927e in /usr/bin/python3)

Vadim2S avatar Jul 26 '21 10:07 Vadim2S