alias-free-gan icon indicating copy to clipboard operation
alias-free-gan copied to clipboard

DataLoader memory problem

Open duskvirkus opened this issue 4 years ago • 1 comments

Occurred when training on V100 colab session. Likely standard memory config.

Low priority bug but still worth noting.

Epoch 17:  81% 6080/7499 [54:22<12:41,  1.86it/s, kimgs=1627.126, r_t_stat=0.750, ada_aug_p=0.484256]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�Traceback (most recent call last):
  File "scripts/trainer.py", line 178, in <module>
  File "scripts/trainer.py", line 175, in cli_main
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
    self._run(model)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
    self._dispatch()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
    self.accelerator.start_training(self)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
    return self._run_train()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
    batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 101, in run
    super().run(batch, batch_idx, dataloader_idx)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 148, in advance
    result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 209, in _run_optimization
    self._update_running_loss(result.loss)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 603, in _update_running_loss
    self.accumulated_loss.append(current_loss)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/supporters.py", line 82, in append
    x = x.to(self.memory)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 873) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
Epoch 17:  81%|████████  | 6080/7499 [54:31<12:43,  1.86it/s, kimgs=1627.126, r_t_stat=0.750, ada_aug_p=0.484256]

duskvirkus avatar Sep 02 '21 02:09 duskvirkus

Occurred on v1.1.0 version of repo.

duskvirkus avatar Sep 02 '21 02:09 duskvirkus