deep-rep
deep-rep copied to clipboard
Training crashed
Hello,
I am trying to re-train your model. I have been able to pre-train the model on SyntheticSR dataset for 500k iterations and everything works fine. But when I switch to continue finetuning the model on BurstSR dataset, I encountered with training crash.
Here is the error log:
.......
[train: 1, 1000 / 1000] FPS: 2.4 (10.7) , Loss/total: 0.03915 , Loss/rgb: 0.03915 , Loss/raw/rgb: 0.00391 , Stat/psnr: 46.35584
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
File "/cluster/.../deep-rep/trainers/base_trainer.py", line 69, in train
self.train_epoch()
File "/cluster/.../deep-rep/trainers/simple_trainer.py", line 95, in train_epoch
self.cycle_dataset(loader)
File "/cluster/.../deep-rep/trainers/simple_trainer.py", line 66, in cycle_dataset
for i, data in enumerate(loader, 1):
File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/dataloader.py", line 1057, in _next_data
self._shutdown_workers()
File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/dataloader.py", line 1177, in _shutdown_workers
w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/multiprocessing/popen_fork.py", line 44, in wait
if not wait([self.sentinel], timeout):
File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 83465) is killed by signal: Terminated.
Restarting training from last epoch ...
....
It seems to be related to the num_worker
setting in the DataLoader. Your default setting is settings.num_workers = 8
, for me, I run the code on a single GPU, so I correspondingly reduce the num_worker to 4. The error occurs when the num_workers is larger than 0, but setting it to 0 will be too slow for training. I am confused about the behavior of the code since everything works fine for SyntheticSR data.
Do you have any idea what might cause this problem? Thank you very much!
Best regards, Shijian
Update: the problem is related to validation loader. But still not sure what happened.