正常的跑了几个epoch以后,我总会遇到下面这个报错,每次都会遇到,哪位大神帮忙看一下!
I can run train.py for some epoch, and the loss always keep decay, but when run a while ,it always Interrupt, the following is the error informations, How can I solve this problem!
Traceback (most recent call last):
File "train.py", line 198, in
cost = trainBatch(crnn, criterion, optimizer)
File "train.py", line 173, in trainBatch
data = train_iter.next()
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 280, in next
idx, batch = self._get_batch()
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 259, in _get_batch
return self.data_queue.get()
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/multiprocessing/queues.py", line 378, in get
return recv()
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
return pickle.loads(buf)
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
fd = multiprocessing.reduction.rebuild_handle(df)
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/multiprocessing/reduction.py", line 157, in rebuild_handle
new_handle = recv_handle(conn)
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/multiprocessing/reduction.py", line 83, in recv_handle
return _multiprocessing.recvfd(conn.fileno())
OSError: [Errno 4] Interrupted system call
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f1ebbb37f50>> ignored
I met the same issue , did you find out where the problem is?
Me too, did you solve it? @encounter1997 @fujingling