Sudden interruption during the training process
I have trained the model on Docker, but I get the following error.
2022-05-13 07:54:44.990512: epoch: 493 2022-05-13 07:57:23.830850: train loss : -0.9766 2022-05-13 07:57:34.664287: validation loss: -0.6276 2022-05-13 07:57:34.664937: Average global foreground Dice: [0.760401282337809] 2022-05-13 07:57:34.665017: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-05-13 07:57:35.181849: lr: 0.005417 2022-05-13 07:57:35.182036: This epoch took 170.191479 s
2022-05-13 07:57:35.182088: epoch: 494 Exception in thread Thread-4: Traceback (most recent call last): File "/public/apps/anaconda3/envs/zhouzidong2/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/public/apps/anaconda3/envs/zhouzidong2/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/public/apps/anaconda3/envs/zhouzidong2/lib/python3.6/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 102, in results_loop item = current_queue.get() File "/public/apps/anaconda3/envs/zhouzidong2/lib/python3.6/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/public/apps/anaconda3/envs/zhouzidong2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd fd = df.detach() File "/public/apps/anaconda3/envs/zhouzidong2/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/public/apps/anaconda3/envs/zhouzidong2/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/public/apps/anaconda3/envs/zhouzidong2/lib/python3.6/multiprocessing/connection.py", line 487, in Client c = SocketClient(address) File "/public/apps/anaconda3/envs/zhouzidong2/lib/python3.6/multiprocessing/connection.py", line 614, in SocketClient s.connect(address) FileNotFoundError: [Errno 2] No such file or directory
I guess there are something wrong with CPU Can you give me some advice?