deep-learning-for-image-processing
deep-learning-for-image-processing copied to clipboard
在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1
这是完整的报错,网上搜了,很多讲的是进程之间通信的问题,这个问题要怎么解决呀?应该在代码中修改哪些位置?
Epoch: [0] [ 0/366] eta: 0:31:04 lr: 0.000000 loss: 2.1887 (2.1887) time: 5.0952 data: 0.7384
Epoch: [0] [ 10/366] eta: 0:15:24 lr: 0.000003 loss: 0.5890 (2.3867) time: 2.5974 data: 0.0681
Epoch: [0] [ 20/366] eta: 0:14:01 lr: 0.000006 loss: 0.2813 (1.7838) time: 2.2994 data: 0.0011
Epoch: [0] [ 30/366] eta: 0:13:21 lr: 0.000009 loss: 2.2992 (1.4588) time: 2.2671 data: 0.0010
Epoch: [0] [ 40/366] eta: 0:12:44 lr: 0.000011 loss: 1.2415 (1.4418) time: 2.2521 data: 0.0010
Epoch: [0] [ 50/366] eta: 0:12:14 lr: 0.000014 loss: 1.4934 (1.4652) time: 2.2295 data: 0.0010
Epoch: [0] [ 60/366] eta: 0:11:49 lr: 0.000017 loss: 0.5944 (1.4093) time: 2.2702 data: 0.0010
Epoch: [0] [ 70/366] eta: 0:11:23 lr: 0.000019 loss: 0.6704 (1.4132) time: 2.2722 data: 0.0010
Epoch: [0] [ 80/366] eta: 0:11:04 lr: 0.000022 loss: 0.3548 (1.3494) time: 2.3282 data: 0.0010
Epoch: [0] [ 90/366] eta: 0:10:39 lr: 0.000025 loss: 0.3015 (1.2649) time: 2.3509 data: 0.0011
Epoch: [0] [100/366] eta: 0:10:14 lr: 0.000028 loss: 0.6640 (1.2471) time: 2.2596 data: 0.0011
Epoch: [0] [110/366] eta: 0:09:51 lr: 0.000030 loss: 2.1179 (1.2050) time: 2.2716 data: 0.0010
Epoch: [0] [120/366] eta: 0:09:27 lr: 0.000033 loss: 2.0124 (1.2004) time: 2.3035 data: 0.0010
Epoch: [0] [130/366] eta: 0:09:04 lr: 0.000036 loss: 1.1753 (1.1981) time: 2.2837 data: 0.0010
Epoch: [0] [140/366] eta: 0:08:39 lr: 0.000039 loss: 2.3567 (1.2141) time: 2.2321 data: 0.0010
Epoch: [0] [150/366] eta: 0:08:18 lr: 0.000041 loss: 0.5729 (1.1973) time: 2.3115 data: 0.0010
Epoch: [0] [160/366] eta: 0:07:54 lr: 0.000044 loss: 0.4893 (1.2001) time: 2.3283 data: 0.0011
Epoch: [0] [170/366] eta: 0:07:30 lr: 0.000047 loss: 0.7241 (1.1839) time: 2.2304 data: 0.0011
Epoch: [0] [180/366] eta: 0:07:06 lr: 0.000050 loss: 1.3635 (1.1723) time: 2.2145 data: 0.0010
Traceback (most recent call last):
File "/public/home/2023020919/FCN/train.py", line 206, in