mmocr
mmocr copied to clipboard
Train models with dataset Synthtext will result in OOM killer.
Thanks for your error report and we appreciate it a lot.
Checklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug
When i try to use dataset synthtext
to train models such as dbnet_r50dcnv2_fpnc
, the process will finnally be killed by the OOM(out of memory).I find that the buff/cache
which can be observed by the command top
in linux increase slowly as training continues, and finally , the memory is run out of. I think there's memory leak in the code that preprocess data, or may be in the TextDataset?But i don't know how to detect and locate it. So please fix this problem.
Just perform the config file `` and observe the usage of memory, you can find it.
- What command or script did you run? I run the config file : configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_100k_iters_synthtext.py
2. Did you make any modifications on the code or config? Did you understand what you have modified?
3. What dataset did you use?
**Environment**
1. Please run `python mmocr/utils/collect_env.py` to collect necessary environment information and paste it here.
2. You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch \[e.g., pip, conda, source\]
- Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)
sys.platform: linux
Python: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: None
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.12.0
OpenCV: 4.6.0
MMCV: 1.5.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMOCR: 0.6.0+688d72f
**Error traceback**
No Error traceback. But It will kill the process, and The message in screen is the usage of memory of process and the ann memory (namely the buff/cache memory) is too large to be allocated.
**Bug fix**
Hi,
This might be a memory leak issue caused by pytorch dataloader. A possible solution for the issue is to replace the default multi-process start method with mp_start_method = 'spawn'
.
https://github.com/open-mmlab/mmocr/blob/1755dad1935c39e82b81d70d7aaf0cab4eb0db62/configs/base/default_runtime.py#L17
Besides, the current val/test dataloaders copy the training dataset, you may also change this to a smaller dataset that you have (or just a toy dataset). https://github.com/open-mmlab/mmocr/blob/1755dad1935c39e82b81d70d7aaf0cab4eb0db62/configs/base/det_datasets/synthtext.py#L18
You may also like to check this issue https://github.com/open-mmlab/mmdetection/issues/7786.
For more details of the reasons caused the problem, you might be interested in reading this discusstion https://forums.fast.ai/t/runtimeerror-dataloader-worker-is-killed-by-signal/31277
Hi,
This might be a memory leak issue caused by pytorch dataloader. A possible solution for the issue is to replace the default multi-process start method with
mp_start_method = 'spawn'
.https://github.com/open-mmlab/mmocr/blob/1755dad1935c39e82b81d70d7aaf0cab4eb0db62/configs/base/default_runtime.py#L17
Besides, the current val/test dataloaders copy the training dataset, you may also change this to a smaller dataset that you have (or just a toy dataset).
https://github.com/open-mmlab/mmocr/blob/1755dad1935c39e82b81d70d7aaf0cab4eb0db62/configs/base/det_datasets/synthtext.py#L18
You may also like to check this issue open-mmlab/mmdetection#7786.
For more details of the reasons caused the problem, you might be interested in reading this discusstion https://forums.fast.ai/t/runtimeerror-dataloader-worker-is-killed-by-signal/31277 I have read all things you provide and i tried to set the mp_start_method='spawn' and set num_workers=0, but it doesn't work. And in my config file, no evaluation. I want to give you more detailes, actually, the increated memory is ann memory which stores the data in stack or heap, not the files.And the increasement is very stable, dozens KBs at each iterator and i have to run 100K iterators.So it result in the OOM.Anything else can help ?
Hi, sorry for the late reply. We trained the synth-text pre-trained model on a cluster that was equipped with very large RAM, so we did not notice the possible memory leak related issue. Thank you for reporting the problem and providing useful information. We are still looking into this problem, we'll let you know as soon as there are any updates, thank you for your patience.
sorry, mybe i was wrong.buff/cache
is the page cache of file.AnonPages
may be the data of heap/stack corresponding to the used
filed in output of command free -h
and size ofused
keeps 14GB.So instead of memory leak, it may be the problem of dataloader.I'm confused.I will update my discovery.
/data/ds/mmocr/mmocr/datasets/pipelines/textdet_targets/base_textdet_targets.py:48: RuntimeWarning: invalid value encountered in sqrt result = np.sqrt(a_square * b_square * square_sin /
I think the cause is the bad file in dataset.I catch the moment that the whole process occupied almost all memory.It used most memory, and free memory decreased dramatically!!!The training almost stopped,whole memory is almost occupied by the training process for minutes. After that, training process continues and the free memory back to normal level.The result = np.sqrt
in base_textdet_targets
is the cause. Maybe there's some bad img or data in dataset.
You should check the code of the line.