Training got stuck due to timeout from dataloader
The same training script works well with Pytorch 1.4 before. Trying to test some new stuff in master branch (built from source), but training always got stuck after a few hundreds iterations without triggering any error info. If I ctrl-C it, it was traced down to some timeout function in dataloader. Again the same training code and configuration worked well with 1.4. Any clue?
Iteration 198: train = 1.8057, g_train = 0.9733, t_train = 0.8221, kl_train = 1.02579618 Iteration 199: train = 0.9988, g_train = 0.2920, t_train = 0.6974, kl_train = 0.93473649 Iteration 200: train = 1.3745, g_train = 0.4477, t_train = 0.9169, kl_train = 0.99940717 saved tex images for 200 Iteration 201: train = 1.1959, g_train = 0.3795, t_train = 0.8027, kl_train = 1.37421489 ^CTraceback (most recent call last): ...... ...... File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data idx, data = self._get_data() File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data success, data = self._try_get_data() File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/connection.py", line 911, in wait ready = selector.select(timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/selectors.py", line 376, in select fd_event_list = self._poll.poll(timeout) KeyboardInterrupt
- PyTorch Version (e.g., 1.0): 1.5.0a0+ab14375
- OS (e.g., Linux): Linux
- How you installed PyTorch (
conda,pip, source): source - Build command you used (if compiling from source): USE_MPI=OFF python setup.py install
- Python version: 3.6
- CUDA/cuDNN version: 9.0
Thanks in advance!
cc @SsnL @VitalyFedyunin @ngimel
It would be nice to know the model you are using. Also is it multi/single GPU training. As many details as possible, as we cannot reproduce it now.
i meet same problem,i use single GPU
i met the same problem as well. it's likely it's stuck on opencv resize. even i use cv2.set_num_threads(0). but still hangs.
the stack trace looks like the following:
File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 160, in _worker_loop r = index_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/connection.py", line 911, in wait ready = selector.select(timeout) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/selectors.py", line 376, in select fd_event_list = self._poll.poll(timeout)
I'm sorry to say the problem still persists. The problem alleviates when worker number is not high but if the worker number is high(say 16), the data loader hangs again. Might have something to do with system resources.
I met the problem without using opencv.
This problem solved,when you no use opencv!??
---Original--- From: "Peng Su"<[email protected]> Date: Fri, May 29, 2020 12:01 PM To: "pytorch/pytorch"<[email protected]>; Cc: "Comment"<[email protected]>;"daixiangzi"<[email protected]>; Subject: Re: [pytorch/pytorch] Training got stuck due to timeout from dataloader (#33296)
I met the problem without using opencv.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Similar problem happened to me, and even cv2 was just imported in dataloader. Set num_workers=0 or comment 'import cv2' then problems is solved. pytorch 1.4 ,python 3.8 , opencv 3.4.2(built from source).
- There are some fixes for CUDA IPC in the coming 1.6 version
- I plan to make timeout errors more explicit to see what is the root cause
Hi, We have the same issue. the error log looks like below. we are not using opencv, we established a connection to Azure data lake to fetch our training data, so is it possible somehow connection closed, and if so, how should we re establish connection?
Or is it because we have over use shared memory?
Thanks Rui
Traceback (most recent call last):
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload er.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 108, in g et
res = self._recv_bytes()
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 411, in _recv_bytes
return self._recv(size)
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/_utils/s ignal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 51897) is killed by signal: Killed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "performance.py", line 71, in <module>
main()
File "performance.py", line 58, in main
for cnt, batch in enumerate(rl_data_loader):
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload er.py", line 345, in __next__
data = self._next_data()
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload er.py", line 841, in _next_data
idx, data = self._get_data()
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload er.py", line 808, in _get_data
success, data = self._try_get_data()
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload er.py", line 774, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.forma t(pids_str))
At the beginning of the program, first import cv2, and then import torch.
import cv2
import torch
It works for me.
My torch version is 1.6.9a+c790476.
@YuechengLi @daixiangzi @weidezhang
i have same problem without opencv
I have checked the the source code for dataloader, basically we could set a timeout
def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
# Tries to fetch data from `self._data_queue` once for a given timeout.
# This can also be used as inner loop of fetching without timeout, with
# the sender status as the loop condition.
#
# This raises a `RuntimeError` if any worker died expectedly. This error
# can come from either the SIGCHLD handler in `_utils/signal_handling.py`
# (only for non-Windows platforms), or the manual check below on errors
# and timeouts.
#
# Returns a 2-tuple:
# (bool: whether successfully get data, any: data if successful else None)
try:
data = self._data_queue.get(timeout=timeout)
the default is 5s,
MP_STATUS_CHECK_INTERVAL = 5.0
Interval (in seconds) to check status of processes to avoid hanging in
multiprocessing data loading. This is mainly used in getting data from
another process, in which case we need to periodically check whether the
sender is alive to prevent hanging.
what is this interval means? I try to increase timeout in multi worker definition but get same error. only difference is final error msg will be like
RuntimeError: DataLoader worker (pid 320) is killed by signal: Segmentation fault.
@zhangruiskyline i met the same issue with pytorch 1.7.1.. have u solved it?
I met the same issue with torch==1.6.0+cu101 have you solved it?
Same here, no GPUs...only cpu and no Opencv. Torch version 1.10.1. I've updated all my packages, including conda and jupyter. Thought it was the notebook having issues, so i produced the .py version of the code....no luck! Problem persists as described above
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py
or
import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
may help
Leaving 1 thread to MKL would wipe away any vectorized optimization, which is a huge speed-up.
Update: I run the code using num_worker=0 and left the NUM_THREADS as default and it completed the job without unexpected interruptions. Perhaps It was just a lucky run, but having tried more than 10 times without success, perhaps that was the right param to set.
cheers
For me the issue was apparently in my augmentations. In albumentations there are some augmentations that can infiniteloop, like randomfog. I was only able to see where the code froze when i set num_workers=0.
I am having this problem and cannot find a reason for it: it fails more or less half of the times. It freezes after the first complete epoch or never.
Getting this issue any time I set num_workers > 0 in one of my projects. Nothing involving OpenCV.
Same issue here. Freezes for higher number of workers, gets stuck after 1 epoch at the same place as OP (select._poll.poll()).
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py
Doesn't sole the issue either.
Using:
- pytorch 1.11.0+cu113
- pytorch_lightning 1.6.5
i also meet this question, get stuck in dataload, then raise this issue
I also face this issue @weidezhang @zhangruiskyline @pbelevich @VitalyFedyunin @YuechengLi is only workaround setting workers=0 ?
in my case, it is due to the num_workers that is set too high in torch.utils.data.DataLoader()
I got the same issue when set num_workers > 0 when using mmdetection. It will be fine if num_workers set to be 0.
Same issue. I don't know the reason why but removing torch.multiprocessing.set_sharing_strategy("file_system") solved the problem. Might be something with the platform I'm using.
Got the same problem and it was due to num_workers. The code is not mine so I don't know details. However, setting num_workers=0 fixed the problem.
My goodness still same issue with torch 2.4, python 3.10, hanging on the exact same line
data = self._data_queue.get(timeout=timeout)
when num_workers > 0 (I used 4) either around the first epoch or never, with the default 5s timeout
MP_STATUS_CHECK_INTERVAL = 5.0
My problem is solved by following https://github.com/Lightning-AI/pytorch-lightning/issues/18149#issuecomment-1834540962. Maybe anyone who meet this problem should also try to set multiprocessing_context='spawn' and num_workers>0.
@xijiu9 I got some kind of pickle error when I tried multiprocessing_context='spawn' or 'forkserver':
(...)
File "/home/████████████████/Downloads/mup-vit/main.py", line 459, in infinite_loader
yield from train_loader
File "/home/████████████████/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 440, in __iter__
return self._get_iterator()
File "/home/████████████████/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/████████████████/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1038, in __init__
w.start()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/local/lib/python3.10/multiprocessing/context.py", line 300, in _Popen
return Popen(process_obj)
File "/usr/local/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/local/lib/python3.10/multiprocessing/popen_forkserver.py", line 47, in _launch
reduction.dump(process_obj, buf)
File "/usr/local/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main_worker.<locals>.<lambda>'
Not sure what the cause is 😕