pytorch Training got stuck due to timeout from dataloader

The same training script works well with Pytorch 1.4 before. Trying to test some new stuff in master branch (built from source), but training always got stuck after a few hundreds iterations without triggering any error info. If I ctrl-C it, it was traced down to some timeout function in dataloader. Again the same training code and configuration worked well with 1.4. Any clue?

Iteration 198: train = 1.8057, g_train = 0.9733, t_train = 0.8221, kl_train = 1.02579618 Iteration 199: train = 0.9988, g_train = 0.2920, t_train = 0.6974, kl_train = 0.93473649 Iteration 200: train = 1.3745, g_train = 0.4477, t_train = 0.9169, kl_train = 0.99940717 saved tex images for 200 Iteration 201: train = 1.1959, g_train = 0.3795, t_train = 0.8027, kl_train = 1.37421489 ^CTraceback (most recent call last): ...... ...... File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data idx, data = self._get_data() File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data success, data = self._try_get_data() File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/connection.py", line 911, in wait ready = selector.select(timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/selectors.py", line 376, in select fd_event_list = self._poll.poll(timeout) KeyboardInterrupt

PyTorch Version (e.g., 1.0): 1.5.0a0+ab14375
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): source
Build command you used (if compiling from source): USE_MPI=OFF python setup.py install
Python version: 3.6
CUDA/cuDNN version: 9.0

Thanks in advance!

cc @SsnL @VitalyFedyunin @ngimel

Feb 13 '20 16:02 YuechengLi

It would be nice to know the model you are using. Also is it multi/single GPU training. As many details as possible, as we cannot reproduce it now.

Feb 19 '20 20:02 VitalyFedyunin

i meet same problem,i use single GPU

Apr 30 '20 01:04 daixiangzi

i met the same problem as well. it's likely it's stuck on opencv resize. even i use cv2.set_num_threads(0). but still hangs.

the stack trace looks like the following:

File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 160, in _worker_loop r = index_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/connection.py", line 911, in wait ready = selector.select(timeout) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/selectors.py", line 376, in select fd_event_list = self._poll.poll(timeout)

May 20 '20 15:05 weidezhang

I'm sorry to say the problem still persists. The problem alleviates when worker number is not high but if the worker number is high(say 16), the data loader hangs again. Might have something to do with system resources.

May 28 '20 20:05 weidezhang

I met the problem without using opencv.

May 29 '20 04:05 psu1

This problem solved,when you no use opencv！？？

---Original--- From: "Peng Su"<[email protected]> Date: Fri, May 29, 2020 12:01 PM To: "pytorch/pytorch"<[email protected]>; Cc: "Comment"<[email protected]>;"daixiangzi"<[email protected]>; Subject: Re: [pytorch/pytorch] Training got stuck due to timeout from dataloader (#33296)

I met the problem without using opencv.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

May 29 '20 06:05 daixiangzi

Similar problem happened to me, and even cv2 was just imported in dataloader. Set num_workers=0 or comment 'import cv2' then problems is solved. pytorch 1.4 ,python 3.8 , opencv 3.4.2(built from source).

Jul 07 '20 07:07 lizhongguo

There are some fixes for CUDA IPC in the coming 1.6 version
I plan to make timeout errors more explicit to see what is the root cause

Jul 15 '20 01:07 VitalyFedyunin

Hi, We have the same issue. the error log looks like below. we are not using opencv, we established a connection to Azure data lake to fetch our training data, so is it possible somehow connection closed, and if so, how should we re establish connection?

Or is it because we have over use shared memory?

Thanks Rui

Traceback (most recent call last):
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload                                                                                                                                                             er.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 108, in g                                                                                                                                                             et
    res = self._recv_bytes()
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 216,                                                                                                                                                              in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 411,                                                                                                                                                              in _recv_bytes
    return self._recv(size)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 379,                                                                                                                                                              in _recv
    chunk = read(handle, remaining)
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/_utils/s                                                                                                                                                             ignal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 51897) is killed by signal: Killed.

During handling of the above exception, another exception occurred:


Traceback (most recent call last):
  File "performance.py", line 71, in <module>
    main()
  File "performance.py", line 58, in main
    for cnt, batch in enumerate(rl_data_loader):
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload                                                                                                                                                             er.py", line 345, in __next__
    data = self._next_data()
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload                                                                                                                                                             er.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload                                                                                                                                                             er.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload                                                                                                                                                             er.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.forma                                                                                                                                                             t(pids_str))

Aug 20 '20 22:08 zhangruiskyline

At the beginning of the program, first import cv2, and then import torch.

import cv2
import torch

It works for me.

My torch version is 1.6.9a+c790476.

@YuechengLi @daixiangzi @weidezhang

Sep 28 '20 12:09 raymon-tian

i have same problem without opencv

Oct 29 '20 03:10 ZhiyuanDang

I have checked the the source code for dataloader, basically we could set a timeout

   def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
        # Tries to fetch data from `self._data_queue` once for a given timeout.
        # This can also be used as inner loop of fetching without timeout, with
        # the sender status as the loop condition.
        #
        # This raises a `RuntimeError` if any worker died expectedly. This error
        # can come from either the SIGCHLD handler in `_utils/signal_handling.py`
        # (only for non-Windows platforms), or the manual check below on errors
        # and timeouts.
        #
        # Returns a 2-tuple:
        #   (bool: whether successfully get data, any: data if successful else None)
        try:
            data = self._data_queue.get(timeout=timeout)

the default is 5s,

MP_STATUS_CHECK_INTERVAL = 5.0
Interval (in seconds) to check status of processes to avoid hanging in
    multiprocessing data loading. This is mainly used in getting data from
    another process, in which case we need to periodically check whether the
    sender is alive to prevent hanging.

what is this interval means? I try to increase timeout in multi worker definition but get same error. only difference is final error msg will be like

RuntimeError: DataLoader worker (pid 320) is killed by signal: Segmentation fault.

Oct 30 '20 03:10 zhangruiskyline

@zhangruiskyline i met the same issue with pytorch 1.7.1.. have u solved it?

Jun 30 '21 16:06 lingcong-k

I met the same issue with torch==1.6.0+cu101 have you solved it?

Nov 03 '21 16:11 eltonfernando

Same here, no GPUs...only cpu and no Opencv. Torch version 1.10.1. I've updated all my packages, including conda and jupyter. Thought it was the notebook having issues, so i produced the .py version of the code....no luck! Problem persists as described above

Jan 16 '22 20:01 walteriviera

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py or

import os
os.environ["OMP_NUM_THREADS"] = "1" 
os.environ["MKL_NUM_THREADS"] = "1"

may help

Jan 17 '22 04:01 Devoe-97

Leaving 1 thread to MKL would wipe away any vectorized optimization, which is a huge speed-up.

Update: I run the code using num_worker=0 and left the NUM_THREADS as default and it completed the job without unexpected interruptions. Perhaps It was just a lucky run, but having tried more than 10 times without success, perhaps that was the right param to set.

cheers

Jan 17 '22 09:01 walteriviera

For me the issue was apparently in my augmentations. In albumentations there are some augmentations that can infiniteloop, like randomfog. I was only able to see where the code froze when i set num_workers=0.

May 25 '22 10:05 opeide

I am having this problem and cannot find a reason for it: it fails more or less half of the times. It freezes after the first complete epoch or never.

Jul 20 '22 12:07 thistlillo

Getting this issue any time I set num_workers > 0 in one of my projects. Nothing involving OpenCV.

Oct 18 '22 23:10 Ryul0rd

Same issue here. Freezes for higher number of workers, gets stuck after 1 epoch at the same place as OP (select._poll.poll()).

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py

Doesn't sole the issue either.

Using:

pytorch 1.11.0+cu113
pytorch_lightning 1.6.5

Nov 18 '22 08:11 aumillera

i also meet this question, get stuck in dataload, then raise this issue

Nov 30 '22 11:11 liyunlongaaa

I also face this issue @weidezhang @zhangruiskyline @pbelevich @VitalyFedyunin @YuechengLi is only workaround setting workers=0 ?

Jan 11 '23 06:01 jaideep11061982

in my case, it is due to the num_workers that is set too high in torch.utils.data.DataLoader()

Feb 24 '23 22:02 forever208

I got the same issue when set num_workers > 0 when using mmdetection. It will be fine if num_workers set to be 0.

Jul 13 '23 17:07 wzhings

Same issue. I don't know the reason why but removing torch.multiprocessing.set_sharing_strategy("file_system") solved the problem. Might be something with the platform I'm using.

Sep 20 '23 08:09 maciejmajek

Got the same problem and it was due to num_workers. The code is not mine so I don't know details. However, setting num_workers=0 fixed the problem.

May 15 '24 21:05 joaolcguerreiro

My goodness still same issue with torch 2.4, python 3.10, hanging on the exact same line

data = self._data_queue.get(timeout=timeout)

when num_workers > 0 (I used 4) either around the first epoch or never, with the default 5s timeout

MP_STATUS_CHECK_INTERVAL = 5.0

Aug 16 '24 23:08 EIFY

My problem is solved by following https://github.com/Lightning-AI/pytorch-lightning/issues/18149#issuecomment-1834540962. Maybe anyone who meet this problem should also try to set multiprocessing_context='spawn' and num_workers>0.

Aug 20 '24 17:08 haochengxi

@xijiu9 I got some kind of pickle error when I tried multiprocessing_context='spawn' or 'forkserver':

  (...)
  File "/home/████████████████/Downloads/mup-vit/main.py", line 459, in infinite_loader
    yield from train_loader
  File "/home/████████████████/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 440, in __iter__
    return self._get_iterator()
  File "/home/████████████████/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/████████████████/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1038, in __init__
    w.start()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/lib/python3.10/multiprocessing/context.py", line 300, in _Popen
    return Popen(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/usr/local/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main_worker.<locals>.<lambda>'

Not sure what the cause is 😕

Aug 24 '24 05:08 EIFY