deep-learning-for-image-processing icon indicating copy to clipboard operation
deep-learning-for-image-processing copied to clipboard

在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1

Open nanwang-crea opened this issue 10 months ago • 5 comments

这是完整的报错,网上搜了,很多讲的是进程之间通信的问题,这个问题要怎么解决呀?应该在代码中修改哪些位置? Epoch: [0] [ 0/366] eta: 0:31:04 lr: 0.000000 loss: 2.1887 (2.1887) time: 5.0952 data: 0.7384 Epoch: [0] [ 10/366] eta: 0:15:24 lr: 0.000003 loss: 0.5890 (2.3867) time: 2.5974 data: 0.0681 Epoch: [0] [ 20/366] eta: 0:14:01 lr: 0.000006 loss: 0.2813 (1.7838) time: 2.2994 data: 0.0011 Epoch: [0] [ 30/366] eta: 0:13:21 lr: 0.000009 loss: 2.2992 (1.4588) time: 2.2671 data: 0.0010 Epoch: [0] [ 40/366] eta: 0:12:44 lr: 0.000011 loss: 1.2415 (1.4418) time: 2.2521 data: 0.0010 Epoch: [0] [ 50/366] eta: 0:12:14 lr: 0.000014 loss: 1.4934 (1.4652) time: 2.2295 data: 0.0010 Epoch: [0] [ 60/366] eta: 0:11:49 lr: 0.000017 loss: 0.5944 (1.4093) time: 2.2702 data: 0.0010 Epoch: [0] [ 70/366] eta: 0:11:23 lr: 0.000019 loss: 0.6704 (1.4132) time: 2.2722 data: 0.0010 Epoch: [0] [ 80/366] eta: 0:11:04 lr: 0.000022 loss: 0.3548 (1.3494) time: 2.3282 data: 0.0010 Epoch: [0] [ 90/366] eta: 0:10:39 lr: 0.000025 loss: 0.3015 (1.2649) time: 2.3509 data: 0.0011 Epoch: [0] [100/366] eta: 0:10:14 lr: 0.000028 loss: 0.6640 (1.2471) time: 2.2596 data: 0.0011 Epoch: [0] [110/366] eta: 0:09:51 lr: 0.000030 loss: 2.1179 (1.2050) time: 2.2716 data: 0.0010 Epoch: [0] [120/366] eta: 0:09:27 lr: 0.000033 loss: 2.0124 (1.2004) time: 2.3035 data: 0.0010 Epoch: [0] [130/366] eta: 0:09:04 lr: 0.000036 loss: 1.1753 (1.1981) time: 2.2837 data: 0.0010 Epoch: [0] [140/366] eta: 0:08:39 lr: 0.000039 loss: 2.3567 (1.2141) time: 2.2321 data: 0.0010 Epoch: [0] [150/366] eta: 0:08:18 lr: 0.000041 loss: 0.5729 (1.1973) time: 2.3115 data: 0.0010 Epoch: [0] [160/366] eta: 0:07:54 lr: 0.000044 loss: 0.4893 (1.2001) time: 2.3283 data: 0.0011 Epoch: [0] [170/366] eta: 0:07:30 lr: 0.000047 loss: 0.7241 (1.1839) time: 2.2304 data: 0.0011 Epoch: [0] [180/366] eta: 0:07:06 lr: 0.000050 loss: 1.3635 (1.1723) time: 2.2145 data: 0.0010 Traceback (most recent call last): File "/public/home/2023020919/FCN/train.py", line 206, in main(args) File "/public/home/2023020919/FCN/train.py", line 141, in main mean_loss, lr = train_one_epoch(model, optimizer, train_loader, device, epoch, File "/public/home/2023020919/FCN/train_utils/train_and_evals.py", line 42, in train_one_epoch for image, target in metric_logger.log_every(data_loader, print_freq, header): File "/public/home/2023020919/FCN/train_utils/distrributed_utils.py", line 189, in log_every for obj in iterable: File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data idx, data = self._get_data() File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data success, data = self._try_get_data() File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd fd = df.detach() File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/connection.py", line 502, in Client c = SocketClient(address) File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient s.connect(address) FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1

nanwang-crea avatar Apr 25 '24 04:04 nanwang-crea