mmocr icon indicating copy to clipboard operation
mmocr copied to clipboard

DataLoader woker is killed by signal .It may be caused by the np.sqrt function.

Open manjaro-git opened this issue 2 years ago • 1 comments

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug /data/ds/mmocr/mmocr/datasets/pipelines/textdet_targets/base_textdet_targets.py:48: RuntimeWarning: invalid value encountered in sqrt result = np.sqrt(a_square * b_square * square_sin / Traceback (most recent call last): File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/multiprocessing/queues.py", line 113, in get if not self._poll(timeout): File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/multiprocessing/connection.py", line 262, in poll return self._poll(timeout) File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/multiprocessing/connection.py", line 429, in _poll r = wait([self], timeout) File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/multiprocessing/connection.py", line 936, in wait ready = selector.select(timeout) File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/selectors.py", line 416, in select fd_event_list = self._selector.poll(timeout) File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 2019) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/data/ds/mmocr/tools/train.py", line 229, in main() File "/data/ds/mmocr/tools/train.py", line 218, in main train_detector( File "/data/ds/mmocr/mmocr/apis/train.py", line 155, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train data_batch = next(data_loader) File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 32, in next data = next(self.iter_loader) File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data idx, data = self._get_data() File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data success, data = self._try_get_data() File "/home/ds/anaconda3/envs/dlenv/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1024, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 2019) exited unexpectedly

Reproduction

  1. What command or script did you run?

I use a config file which replace the r50 backbone with swin transformer in dbnet_r50dcnv2_fpnc_10k_synthtext.py. A placeholder for the command.

  1. Did you make any modifications on the code or config? Did you understand what you have modified?
  2. What dataset did you use?

Environment

sys.platform: linux
Python: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: None
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.12.0
OpenCV: 4.6.0
MMCV: 1.5.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMOCR: 0.6.0+688d72f

Error traceback If applicable, paste the error traceback here.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

manjaro-git avatar Jul 19 '22 06:07 manjaro-git

I find that when encounter the invalid value in sqrt , one of my four gpus is stopped, it's usage is zero, but three others are 100%.This case will coninute for seconds or one minute.And lots of memory of RAM is being occupied(my ram is 64GB, only 17GB available) . Finally, the process either breaks down or continue like nothing happened.I believe it make the dataloader crash.And If process can continue ,available memory increase to 47GB.Before i reported an error(https://github.com/open-mmlab/mmocr/issues/1165), Do you think both of them has relationship?

manjaro-git avatar Jul 19 '22 08:07 manjaro-git