deep-rep Training crashed

Training crashed

Open ShijianXu opened this issue 1 year ago • 1 comments

Hello,

I am trying to re-train your model. I have been able to pre-train the model on SyntheticSR dataset for 500k iterations and everything works fine. But when I switch to continue finetuning the model on BurstSR dataset, I encountered with training crash.

Here is the error log:

.......
[train: 1, 1000 / 1000] FPS: 2.4 (10.7)  ,  Loss/total: 0.03915  ,  Loss/rgb: 0.03915  ,  Loss/raw/rgb: 0.00391  ,  Stat/psnr: 46.35584
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
  File "/cluster/.../deep-rep/trainers/base_trainer.py", line 69, in train
    self.train_epoch()
  File "/cluster/.../deep-rep/trainers/simple_trainer.py", line 95, in train_epoch
    self.cycle_dataset(loader)
  File "/cluster/.../deep-rep/trainers/simple_trainer.py", line 66, in cycle_dataset
    for i, data in enumerate(loader, 1):
  File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/dataloader.py", line 1057, in _next_data
    self._shutdown_workers()
  File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/dataloader.py", line 1177, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/multiprocessing/popen_fork.py", line 44, in wait
    if not wait([self.sentinel], timeout):
  File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 83465) is killed by signal: Terminated.

Restarting training from last epoch ...
....

It seems to be related to the num_worker setting in the DataLoader. Your default setting is settings.num_workers = 8, for me, I run the code on a single GPU, so I correspondingly reduce the num_worker to 4. The error occurs when the num_workers is larger than 0, but setting it to 0 will be too slow for training. I am confused about the behavior of the code since everything works fine for SyntheticSR data.

Do you have any idea what might cause this problem? Thank you very much!

Best regards, Shijian

Aug 31 '23 13:08 ShijianXu

Update: the problem is related to validation loader. But still not sure what happened.

Sep 01 '23 02:09 ShijianXu

deep-rep deep-rep copied to clipboard

Training crashed

deep-rep
deep-rep copied to clipboard