pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

Training got stuck due to timeout from dataloader

Open YuechengLi opened this issue 5 years ago • 32 comments

The same training script works well with Pytorch 1.4 before. Trying to test some new stuff in master branch (built from source), but training always got stuck after a few hundreds iterations without triggering any error info. If I ctrl-C it, it was traced down to some timeout function in dataloader. Again the same training code and configuration worked well with 1.4. Any clue?

Iteration 198: train = 1.8057, g_train = 0.9733, t_train = 0.8221, kl_train = 1.02579618 Iteration 199: train = 0.9988, g_train = 0.2920, t_train = 0.6974, kl_train = 0.93473649 Iteration 200: train = 1.3745, g_train = 0.4477, t_train = 0.9169, kl_train = 0.99940717 saved tex images for 200 Iteration 201: train = 1.1959, g_train = 0.3795, t_train = 0.8027, kl_train = 1.37421489 ^CTraceback (most recent call last): ...... ...... File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data idx, data = self._get_data() File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data success, data = self._try_get_data() File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/multiprocessing/connection.py", line 911, in wait ready = selector.select(timeout) File "/mnt/home/xxxx/anaconda3/envs/pytorch-py36/lib/python3.6/selectors.py", line 376, in select fd_event_list = self._poll.poll(timeout) KeyboardInterrupt

  • PyTorch Version (e.g., 1.0): 1.5.0a0+ab14375
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): source
  • Build command you used (if compiling from source): USE_MPI=OFF python setup.py install
  • Python version: 3.6
  • CUDA/cuDNN version: 9.0

Thanks in advance!

cc @SsnL @VitalyFedyunin @ngimel

YuechengLi avatar Feb 13 '20 16:02 YuechengLi

It would be nice to know the model you are using. Also is it multi/single GPU training. As many details as possible, as we cannot reproduce it now.

VitalyFedyunin avatar Feb 19 '20 20:02 VitalyFedyunin

i meet same problem,i use single GPU

daixiangzi avatar Apr 30 '20 01:04 daixiangzi

i met the same problem as well. it's likely it's stuck on opencv resize. even i use cv2.set_num_threads(0). but still hangs.

the stack trace looks like the following:

File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 160, in _worker_loop r = index_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/multiprocessing/connection.py", line 911, in wait ready = selector.select(timeout) File "/home/weide1/anaconda3/envs/fuel-py36/lib/python3.6/selectors.py", line 376, in select fd_event_list = self._poll.poll(timeout)

weidezhang avatar May 20 '20 15:05 weidezhang

I'm sorry to say the problem still persists. The problem alleviates when worker number is not high but if the worker number is high(say 16), the data loader hangs again. Might have something to do with system resources.

weidezhang avatar May 28 '20 20:05 weidezhang

I met the problem without using opencv.

psu1 avatar May 29 '20 04:05 psu1

This problem solved,when you no use opencv!??

---Original--- From: "Peng Su"<[email protected]> Date: Fri, May 29, 2020 12:01 PM To: "pytorch/pytorch"<[email protected]>; Cc: "Comment"<[email protected]>;"daixiangzi"<[email protected]>; Subject: Re: [pytorch/pytorch] Training got stuck due to timeout from dataloader (#33296)

I met the problem without using opencv.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

daixiangzi avatar May 29 '20 06:05 daixiangzi

Similar problem happened to me, and even cv2 was just imported in dataloader. Set num_workers=0 or comment 'import cv2' then problems is solved. pytorch 1.4 ,python 3.8 , opencv 3.4.2(built from source).

lizhongguo avatar Jul 07 '20 07:07 lizhongguo

  1. There are some fixes for CUDA IPC in the coming 1.6 version
  2. I plan to make timeout errors more explicit to see what is the root cause

VitalyFedyunin avatar Jul 15 '20 01:07 VitalyFedyunin

Hi, We have the same issue. the error log looks like below. we are not using opencv, we established a connection to Azure data lake to fetch our training data, so is it possible somehow connection closed, and if so, how should we re establish connection?

Or is it because we have over use shared memory?

Thanks Rui

Traceback (most recent call last):
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload                                                                                                                                                             er.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 108, in g                                                                                                                                                             et
    res = self._recv_bytes()
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 216,                                                                                                                                                              in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 411,                                                                                                                                                              in _recv_bytes
    return self._recv(size)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 379,                                                                                                                                                              in _recv
    chunk = read(handle, remaining)
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/_utils/s                                                                                                                                                             ignal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 51897) is killed by signal: Killed.

During handling of the above exception, another exception occurred:


Traceback (most recent call last):
  File "performance.py", line 71, in <module>
    main()
  File "performance.py", line 58, in main
    for cnt, batch in enumerate(rl_data_loader):
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload                                                                                                                                                             er.py", line 345, in __next__
    data = self._next_data()
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload                                                                                                                                                             er.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload                                                                                                                                                             er.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataload                                                                                                                                                             er.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.forma                                                                                                                                                             t(pids_str))

zhangruiskyline avatar Aug 20 '20 22:08 zhangruiskyline

At the beginning of the program, first import cv2, and then import torch.

import cv2
import torch

It works for me.

My torch version is 1.6.9a+c790476.

@YuechengLi @daixiangzi @weidezhang

raymon-tian avatar Sep 28 '20 12:09 raymon-tian

i have same problem without opencv

ZhiyuanDang avatar Oct 29 '20 03:10 ZhiyuanDang

I have checked the the source code for dataloader, basically we could set a timeout

   def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
        # Tries to fetch data from `self._data_queue` once for a given timeout.
        # This can also be used as inner loop of fetching without timeout, with
        # the sender status as the loop condition.
        #
        # This raises a `RuntimeError` if any worker died expectedly. This error
        # can come from either the SIGCHLD handler in `_utils/signal_handling.py`
        # (only for non-Windows platforms), or the manual check below on errors
        # and timeouts.
        #
        # Returns a 2-tuple:
        #   (bool: whether successfully get data, any: data if successful else None)
        try:
            data = self._data_queue.get(timeout=timeout)

the default is 5s,

MP_STATUS_CHECK_INTERVAL = 5.0
Interval (in seconds) to check status of processes to avoid hanging in
    multiprocessing data loading. This is mainly used in getting data from
    another process, in which case we need to periodically check whether the
    sender is alive to prevent hanging.

what is this interval means? I try to increase timeout in multi worker definition but get same error. only difference is final error msg will be like

RuntimeError: DataLoader worker (pid 320) is killed by signal: Segmentation fault.

zhangruiskyline avatar Oct 30 '20 03:10 zhangruiskyline

@zhangruiskyline i met the same issue with pytorch 1.7.1.. have u solved it?

lingcong-k avatar Jun 30 '21 16:06 lingcong-k

I met the same issue with torch==1.6.0+cu101 have you solved it?

eltonfernando avatar Nov 03 '21 16:11 eltonfernando

Same here, no GPUs...only cpu and no Opencv. Torch version 1.10.1. I've updated all my packages, including conda and jupyter. Thought it was the notebook having issues, so i produced the .py version of the code....no luck! Problem persists as described above

walteriviera avatar Jan 16 '22 20:01 walteriviera

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py or

import os
os.environ["OMP_NUM_THREADS"] = "1" 
os.environ["MKL_NUM_THREADS"] = "1" 

may help

Devoe-97 avatar Jan 17 '22 04:01 Devoe-97

Leaving 1 thread to MKL would wipe away any vectorized optimization, which is a huge speed-up.

Update: I run the code using num_worker=0 and left the NUM_THREADS as default and it completed the job without unexpected interruptions. Perhaps It was just a lucky run, but having tried more than 10 times without success, perhaps that was the right param to set.

cheers

walteriviera avatar Jan 17 '22 09:01 walteriviera

For me the issue was apparently in my augmentations. In albumentations there are some augmentations that can infiniteloop, like randomfog. I was only able to see where the code froze when i set num_workers=0.

opeide avatar May 25 '22 10:05 opeide

I am having this problem and cannot find a reason for it: it fails more or less half of the times. It freezes after the first complete epoch or never.

thistlillo avatar Jul 20 '22 12:07 thistlillo

Getting this issue any time I set num_workers > 0 in one of my projects. Nothing involving OpenCV.

Ryul0rd avatar Oct 18 '22 23:10 Ryul0rd

Same issue here. Freezes for higher number of workers, gets stuck after 1 epoch at the same place as OP (select._poll.poll()).

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py

Doesn't sole the issue either.

Using:

  • pytorch 1.11.0+cu113
  • pytorch_lightning 1.6.5

aumillera avatar Nov 18 '22 08:11 aumillera

i also meet this question, get stuck in dataload, then raise this issue

liyunlongaaa avatar Nov 30 '22 11:11 liyunlongaaa

I also face this issue @weidezhang @zhangruiskyline @pbelevich @VitalyFedyunin @YuechengLi is only workaround setting workers=0 ?

jaideep11061982 avatar Jan 11 '23 06:01 jaideep11061982

in my case, it is due to the num_workers that is set too high in torch.utils.data.DataLoader()

forever208 avatar Feb 24 '23 22:02 forever208

I got the same issue when set num_workers > 0 when using mmdetection. It will be fine if num_workers set to be 0.

wzhings avatar Jul 13 '23 17:07 wzhings

Same issue. I don't know the reason why but removing torch.multiprocessing.set_sharing_strategy("file_system") solved the problem. Might be something with the platform I'm using.

maciejmajek avatar Sep 20 '23 08:09 maciejmajek

Got the same problem and it was due to num_workers. The code is not mine so I don't know details. However, setting num_workers=0 fixed the problem.

joaolcguerreiro avatar May 15 '24 21:05 joaolcguerreiro

My goodness still same issue with torch 2.4, python 3.10, hanging on the exact same line

data = self._data_queue.get(timeout=timeout)

when num_workers > 0 (I used 4) either around the first epoch or never, with the default 5s timeout

MP_STATUS_CHECK_INTERVAL = 5.0

EIFY avatar Aug 16 '24 23:08 EIFY

My problem is solved by following https://github.com/Lightning-AI/pytorch-lightning/issues/18149#issuecomment-1834540962. Maybe anyone who meet this problem should also try to set multiprocessing_context='spawn' and num_workers>0.

haochengxi avatar Aug 20 '24 17:08 haochengxi

@xijiu9 I got some kind of pickle error when I tried multiprocessing_context='spawn' or 'forkserver':

  (...)
  File "/home/████████████████/Downloads/mup-vit/main.py", line 459, in infinite_loader
    yield from train_loader
  File "/home/████████████████/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 440, in __iter__
    return self._get_iterator()
  File "/home/████████████████/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/████████████████/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1038, in __init__
    w.start()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/lib/python3.10/multiprocessing/context.py", line 300, in _Popen
    return Popen(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/usr/local/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main_worker.<locals>.<lambda>'

Not sure what the cause is 😕

EIFY avatar Aug 24 '24 05:08 EIFY