mmdetection Dist training failed due to FileNotFoundError

Hi, @BIGWangYuDong @ZwwWayne , Thanks for your help and I appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug I use two machines to train deformable-detr. The training is smooth but fail to eval, which means I have to re-train model after one single epoch.

Reproduction

What command or script did you run?

For node 1

NNODES=2 NODE_RANK=0 PORT=23456 MASTER_ADDR=xxxx bash tools/dist_train.sh configs/coco/deformable_detr_r50_16x2_50e_coco.py 8

For node 2

NNODES=2 NODE_RANK=1 PORT=23456 MASTER_ADDR=xxxx bash tools/dist_train.sh configs/coco/deformable_detr_r50_16x2_50e_coco.py 8

Did you make any modifications on the code or config? Did you understand what you have modified? No.Yes.
What dataset did you use? COCO2017

Error traceback

raceback (most recent call last):
  File "tools/train.py", line 242, in <module>
    main()
  File "tools/train.py", line 238, in main
    meta=meta)
  File "/home/tiger/.local/lib/python3.7/site-packages/mmdet/apis/train.py", line 244, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/tiger/.local/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/tiger/.local/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 58, in train
    self.call_hook('after_train_epoch')
  File "/home/tiger/.local/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/tiger/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 271, in after_train_epoch
    self._do_evaluate(runner)
  File "/home/tiger/.local/lib/python3.7/site-packages/mmdet/core/evaluation/eval_hooks.py", line 130, in _do_evaluate
    gpu_collect=self.gpu_collect)
  File "/home/tiger/.local/lib/python3.7/site-packages/mmdet/apis/test.py", line 132, in multi_gpu_test
    results = collect_results_cpu(results, len(dataset), tmpdir)
  File "/home/tiger/.local/lib/python3.7/site-packages/mmdet/apis/test.py", line 167, in collect_results_cpu
    part_list.append(mmcv.load(part_file))
  File "/home/tiger/.local/lib/python3.7/site-packages/mmcv/fileio/io.py", line 67, in load
    with BytesIO(file_client.get(file)) as f:
  File "/home/tiger/.local/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 1014, in get
    return self.client.get(filepath)
  File "/home/tiger/.local/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 535, in get
    with open(filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/tiger/code/mmdet/work_dirs/deformable_detr_r50_16x2_50e_coco/.eval_hook/part_8.pkl'

Environment

sys.platform: linux
Python: 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: A100-SXM-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.109
GCC: x86_64-linux-gnu-gcc (Debian 8.3.0-6) 8.3.0
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.1+cu113
OpenCV: 4.6.0
MMCV: 1.6.1
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.25.1+

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Sep 16 '22 03:09 liming-ai

Hi @ZwwWayne, Could you please help me since there is no reply from @BIGWangYuDong ?

Sep 19 '22 13:09 liming-ai

set evalution.gpu_collect=True in the dataset config, as recommended in issue. Set work_dir to a shared folder between nodes is another choice.

Sep 23 '22 03:09 WenkaiYe

Hi, sorry for the late reply, have you fixed this issue?

Sep 30 '22 02:09 BIGWangYuDong

mmdetection mmdetection copied to clipboard

Dist training failed due to FileNotFoundError

mmdetection
mmdetection copied to clipboard