mmdetection
mmdetection copied to clipboard
Multi-gpu training gets stuck
Checklist
- [x] I have searched related issues but cannot get the expected help.
- [x] I have read the FAQ documentation but cannot get the expected help.
- [x] The bug has not been fixed in the latest version.
Describe the bug
Single GPU training works fine, single node multi-GPU doesn't.
Relevant: #3823, #2193, #1979, #4535, maybe #3973
Rolling back intel-openmp
doesn't help, and it uses only default configs.
I.e. running this is ok:
python tools/train.py configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py
Outputs:
...
2021-11-17 17:33:20,582 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,583 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,584 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,586 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,588 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,589 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,598 - mmdet - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}
2021-11-17 17:33:20,615 - mmdet - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01}
2021-11-17 17:33:20,621 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg
'}}, {'type': 'Xavier', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}]
loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
loading annotations into memory...
Done (t=0.06s)
creating index...
index created!
2021-11-17 17:33:23,497 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
2021-11-17 17:33:23,597 - mmdet - WARNING - The model and loaded state dict do not match exactly
size mismatch for roi_head.bbox_head.fc_cls.weight: copying a param with shape torch.Size([81, 1024]) from checkpoint, the shape in current model is torch.Size([9, 1024]).
size mismatch for roi_head.bbox_head.fc_cls.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([9]).
size mismatch for roi_head.bbox_head.fc_reg.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([32, 1024]).
size mismatch for roi_head.bbox_head.fc_reg.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([32]).
2021-11-17 17:33:23,600 - mmdet - INFO - Start running, host: vince@wombat, work_dir: /home/vince/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_cityscapes
2021-11-17 17:33:23,600 - mmdet - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
...
--------------------
after_val_epoch:
(VERY_LOW ) TextLoggerHook
--------------------
after_run:
(VERY_LOW ) TextLoggerHook
--------------------
2021-11-17 17:33:23,600 - mmdet - INFO - workflow: [('train', 1)], max: 8 epochs
2021-11-17 17:33:23,600 - mmdet - INFO - Checkpoints will be saved to /home/vince/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_cityscapes by HardDiskBackend.
2021-11-17 17:33:54,886 - mmdet - INFO - Epoch [1][100/23720] lr: 1.988e-03, eta: 16:25:12, time: 0.312, data_time: 0.025, memory: 4024, loss_rpn_cls: 0.0439, loss_rpn_bbox: 0.0921, loss_cls: 0.7297, acc: 80.1914, loss_bbox: 0.3776, loss: 1.2433
2021-11-17 17:34:23,521 - mmdet - INFO - Epoch [1][200/23720] lr: 3.986e-03, eta: 15:44:40, time: 0.286, data_time: 0.004, memory: 4073, loss_rpn_cls: 0.0450, loss_rpn_bbox: 0.1027, loss_cls: 0.3770, acc: 86.8848, loss_bbox: 0.2467, loss: 0.7713
Running this is not ok:
./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2
Outputs:
...
2021-11-17 17:31:12,229 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,230 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,231 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,232 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,233 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,233 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,234 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,236 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,237 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,238 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,239 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,241 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,242 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,246 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,250 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,253 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,270 - mmdet - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}
2021-11-17 17:31:12,287 - mmdet - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01}
2021-11-17 17:31:12,290 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg
'}}, {'type': 'Xavier', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}]
loading annotations into memory...
Done (t=0.42s)
creating index...
index created!
loading annotations into memory...
Done (t=0.06s)
creating index...
index created!
2021-11-17 17:31:13,772 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
and then just waits.
Environment
sys.platform: linux
Python: 3.8.12 (default, Nov 17 2021, 08:17:37) [GCC 9.3.0]
CUDA available: True
GPU 0,1: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.5.r11.5/compiler.30411180_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.0+cu113
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.11.1+cu113
OpenCV: 4.5.4-dev
MMCV: 1.3.17
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.18.1+c76ab0e
Seems stuck at loading checkpoint? Could you interrupt the program and see where it really stuck?
@RangiLyu after CTRL-C
it gives this:
2021-11-21 09:34:37,477 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
^CWARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4006362 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4006363 closing signal SIGINT
^CWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4006362 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4006363 closing signal SIGTERM
^CTraceback (most recent call last):
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
time.sleep(monitor_interval)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4006288 got signal: 2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 716, in run
self._shutdown(e.sigval)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 190, in _shutdown
self._pcontext.close(death_sig)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 330, in close
self._close(death_sig=death_sig, timeout=timeout)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 720, in _close
handler.proc.wait(time_to_wait)
File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/subprocess.py", line 1083, in wait
return self._wait(timeout=timeout)
File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/subprocess.py", line 1800, in _wait
time.sleep(delay)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4006288 got signal: 2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 721, in run
self._shutdown()
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 190, in _shutdown
self._pcontext.close(death_sig)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 330, in close
self._close(death_sig=death_sig, timeout=timeout)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 720, in _close
handler.proc.wait(time_to_wait)
File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/subprocess.py", line 1083, in wait
return self._wait(timeout=timeout)
File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/subprocess.py", line 1800, in _wait
time.sleep(delay)
File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4006288 got signal: 2
Does that help?
Also note, with 1 training worker it works fine, i.e.:
./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 1
but with 2 it gets stuck:
./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2
I did a bit of digging and switching to torchrun
(as the deprecation warning says) makes it work. However, train.py
needs to be changed slightly. I opened a PR (#6557) with the changes, let me know what you think.
@vakker Do you know why torch.distributed.launch
will block?
Actually, the issue is not present because I didn't use the --launcher pytorch
, so the distributed
flag was not set at all.
If I set that (either for torchrun
or torch.distributed.launch
) then it hangs:
- in
seed = init_random_seed(args.seed)
on thedist.broadcast(random_num, src=0)
line - in
train_detector
on themodel = MMDistributedDataParallel
line
So, there might be multiple issues related to how the distributed training is initialized maybe.
@hhaAndroid do you have any input on moving this further?
@RangiLyu @hhaAndroid could you confirm that the issue can be reproduced on your end as well (just run the 2 commands in the issue description)? I'm happy to investigate this further, but it's unclear at this point if this is a broader issue or there's just something wrong with my setup.
@RangiLyu @hhaAndroid do you have any update on this issue?
We support three kinds of distributed launcher in MMCV https://github.com/open-mmlab/mmcv/blob/86ed509a8bc783cdd0617efeba257375d5aa6658/mmcv/runner/dist_utils.py#L14
If you did not use the --launcher pytorch
, the init_dist will not be triggered and some env variables will not be set. Would you like to check if this caused the problem?
Thanks for the response.
That's how I run it, please see my previous post:
If I set that (either for
torchrun
ortorch.distributed.launch
) then it hangs:1. in `seed = init_random_seed(args.seed)` on the `dist.broadcast(random_num, src=0)` line 2. in `train_detector` on the `model = MMDistributedDataParallel` line
I.e. the issue is present when the pytorch
launcher is used.
Could you please confirm at least the you're able to reproduce the issue?
@RangiLyu I just tried again today from the latest master and the issue still persists. Just to reiterate the reproduction steps: This works:
python tools/train.py configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py
This gets stuck:
bash ./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2
Env info:
sys.platform: linux
Python: 3.8.12 (default, Nov 17 2021, 08:17:37) [GCC 9.3.0]
CUDA available: True
GPU 0,1: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.5.r11.5/compiler.30411180_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.0+cu113
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.11.1+cu113
OpenCV: 4.5.4-dev
MMCV: 1.3.17
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.18.1+a7a16af
Please let me know if you need any more information. Thanks!
Hi @vakker , It might be caused by limited memory. You may try workers_per_gpu=0 or 1.
@ZwwWayne Thanks for the suggestion, but that's not the issue. Just to be sure, I tried workers_per_gpu=0
(and also the default 2), and the issue persists, while there's more than 50GB of ram available on the machine.
If you run the reproduction steps above, do you experience the same or not? That way it would be easy to narrow down the possibilities. Thanks!
I tried but could not reproduce this problem.
@RangiLyu thanks, so just to be sure, you run bash ./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2
and it didn't get stuck, right?
Then could you provide your environment info, so I can compare and see what causes this? Thanks
@RangiLyu thanks, so just to be sure, you run
bash ./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2
and it didn't get stuck, right? Then could you provide your environment info, so I can compare and see what causes this? Thanks
Yes, I run bash ./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2
the same as you provided. But the difference is that I used srun
to start the dist_train.sh because our cluster uses slurm, but I still used the pytorch launcher.
And here is my environment:
sys.platform: linux
Python: 3.6.13 | packaged by conda-forge | (default, Sep 23 2021, 07:56:31) [GCC 9.4.0]
CUDA available: True
GPU 0,1: TITAN Xp
CUDA_HOME: /mnt/lustre/share/polaris/dep/cuda-9.0-cudnn7.6.5
NVCC: Cuda compilation tools, release 9.0, V9.0.176
GCC: gcc (GCC) 5.4.0
PyTorch: 1.8.1+cuda90.cudnn7.6.5
PyTorch compiling details: PyTorch built with:
- GCC 5.4
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
- OpenMP 201307 (a.k.a. OpenMP 4.0)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 9.0
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70
- CuDNN 7.6.5
- Magma 2.5.0
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=9.0, CUDNN_VERSION=7.6.5, CXX_COMPILER=/usr/local/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp
-DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare
-Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-psabi -Wno-error=pedantic -Wno-error=redund
ant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512
=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.8.0a0+2f40a48
OpenCV: 4.5.4
MMCV: 1.4.0
MMCV Compiler: GCC 5.4
MMCV CUDA Compiler: 9.0
MMDetection: 2.20.0+ff9bc39
@RangiLyu thanks for the details. I'll try to match your setup in terms of library versions, as you're using different Python, Pytorch, Torchvision, Cuda, CuDNN versions, and see if I can reproduce the issue.
@RangiLyu @hhaAndroid I was able to reproduce this issue with Docker, I tried a broad range of settings:
Pytorch | Cuda | CuDNN | Works |
---|---|---|---|
1.6.0 | 10.1 | 7 | :heavy_check_mark: |
1.7.0 | 11.0 | 8 | :x: |
1.8.0 | 11.1 | 8 | :x: |
1.9.0 | 10.2 | 7 | :x: |
1.9.0 | 11.1 | 8 | :x: |
1.10.0 | 11.3 | 8 | :x: |
Dockerfile
Base on the official Dockerfile:
ARG PYTORCH="1.6.0"
ARG CUDA="10.1"
ARG CUDNN="7"
FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel
ARG MMCV="1.4.6"
ARG PYTORCH
ARG CUDA
ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"
RUN apt-get update && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Install MMCV
RUN pip install --no-cache-dir --upgrade pip wheel setuptools
RUN ["/bin/bash", "-c", "pip install mmcv-full==${MMCV} -f https://download.openmmlab.com/mmcv/dist/cu${CUDA//./}/torch${PYTORCH}/index.html"]
# Install MMDetection
RUN conda clean --all
RUN git clone https://github.com/open-mmlab/mmdetection.git /mmdetection
WORKDIR /mmdetection
ENV FORCE_CUDA="1"
RUN pip install --no-cache-dir -r requirements/build.txt
RUN pip install --no-cache-dir -e .
Build:
docker build \
--build-arg PYTORCH=<pytorch> \
--build-arg CUDA=<cuda> \
--build-arg CUDNN=<cudnn> \
-t mmdet:<pytorch>-<cuda>-cudnn> .
After some digging, I found this discussion, so I tried using the gloo
backend instead of nccl
, i.e. in configs/_base_/default_runtime.py
I changed dist_params = dict(backend='nccl')
to dist_params = dict(backend='gloo')
.
That does make it work on all the test scenarios:
Pytorch | Cuda | CuDNN | Works |
---|---|---|---|
1.6.0 | 10.1 | 7 | :heavy_check_mark: |
1.7.0 | 11.0 | 8 | :heavy_check_mark: |
1.8.0 | 11.1 | 8 | :heavy_check_mark: |
1.9.0 | 10.2 | 7 | :heavy_check_mark: |
1.9.0 | 11.1 | 8 | :heavy_check_mark: |
1.10.0 | 11.3 | 8 | :heavy_check_mark: |
Please note, that the Cuda version that you used (9.0) to reproduce this issue is really old, it was released in 2017, I couldn't even find a Pytorch Docker image for that. It might make sense to test the code using versions that are more commonly used recently (10.2, 11.3). I'm not sure Pytorch even supports anything less than 10.2, see here.
Please, verify if this is an actual bug that you can reproduce.
Hi @vakker @RangiLyu @ZwwWayne
I experienced the same problem, what should be the recommended fix here?
Using torchrun
or down-grade pytorch version?
@DangChuong-DC I'm not sure what the right solution is, the MMLab team should reproduce this and investigate further.
It seems like that changing the dist backend to gloo
instead of nccl
works, but it might have performance implications.
Maybe rewriting the dist_train.sh
script to work with torchrun
should be the way, but that'll require changes in train.py
too.
@RangiLyu @hhaAndroid I was able to reproduce this issue with Docker, I tried a broad range of settings:
Pytorch Cuda CuDNN Works 1.6.0 10.1 7 ✔️ 1.7.0 11.0 8 ❌ 1.8.0 11.1 8 ❌ 1.9.0 10.2 7 ❌ 1.9.0 11.1 8 ❌ 1.10.0 11.3 8 ❌ Dockerfile Base on the official Dockerfile:
ARG PYTORCH="1.6.0" ARG CUDA="10.1" ARG CUDNN="7" FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel ARG MMCV="1.4.6" ARG PYTORCH ARG CUDA ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" RUN apt-get update && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # Install MMCV RUN pip install --no-cache-dir --upgrade pip wheel setuptools RUN ["/bin/bash", "-c", "pip install mmcv-full==${MMCV} -f https://download.openmmlab.com/mmcv/dist/cu${CUDA//./}/torch${PYTORCH}/index.html"] # Install MMDetection RUN conda clean --all RUN git clone https://github.com/open-mmlab/mmdetection.git /mmdetection WORKDIR /mmdetection ENV FORCE_CUDA="1" RUN pip install --no-cache-dir -r requirements/build.txt RUN pip install --no-cache-dir -e .
Build:
docker build \ --build-arg PYTORCH=<pytorch> \ --build-arg CUDA=<cuda> \ --build-arg CUDNN=<cudnn> \ -t mmdet:<pytorch>-<cuda>-cudnn> .
After some digging, I found this discussion, so I tried using the
gloo
backend instead ofnccl
, i.e. inconfigs/_base_/default_runtime.py
I changeddist_params = dict(backend='nccl')
todist_params = dict(backend='gloo')
.That does make it work on all the test scenarios:
Pytorch Cuda CuDNN Works 1.6.0 10.1 7 ✔️ 1.7.0 11.0 8 ✔️ 1.8.0 11.1 8 ✔️ 1.9.0 10.2 7 ✔️ 1.9.0 11.1 8 ✔️ 1.10.0 11.3 8 ✔️ Please note, that the Cuda version that you used (9.0) to reproduce this issue is really old, it was released in 2017, I couldn't even find a Pytorch Docker image for that. It might make sense to test the code using versions that are more commonly used recently (10.2, 11.3). I'm not sure Pytorch even supports anything less than 10.2, see here.
Please, verify if this is an actual bug that you can reproduce.
I experienced a similar problem. It works for me. The outputs are as follows.
2022-04-01 12:15:36,770 - mmseg - INFO - Set random seed to 0, deterministic: False
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is a deprecated, '
2022-04-01 12:15:37,557 - mmseg - INFO - initialize ResNetV1c with init_cfg {'type': 'Pretrained', 'checkpoint': 'open-mmlab://resnet101_v1c'}
2022-04-01 12:15:37,558 - mmcv - INFO - load model from: open-mmlab://resnet101_v1c
2022-04-01 12:15:37,558 - mmcv - INFO - load checkpoint from openmmlab path: open-mmlab://resnet101_v1c
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800337 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800337 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800795 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800809 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800832 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800795 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800832 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801043 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800809 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801043 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2000 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2001 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2002 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2004 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2005 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2006 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 2003) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
tools/train.py FAILED
-----------------------------------------------------
Failures:
[1]:
time : 2022-04-01_12:45:45
host : d7537e2e1710
rank : 7 (local_rank: 7)
exitcode : -6 (pid: 2007)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2007
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-04-01_12:45:45
host : d7537e2e1710
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 2003)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2003
=====================================================
I also changed dist_params = dict(backend='nccl')
to dist_params = dict(backend='gloo')
.
It works for me. Thanks!
My solution is below: Add following commands in ~/.bashrc
export NCCL_P2P_DISABLE="1" export NCCL_IB_DISABLE="1"
Then, source ~/.bashrc. It works for me. No need to modify nccl -> gool.
Thank you but none of the above works for me : ( After days of searching and trying, my workaround is to set the system run level to 3:
sudo init 3
The GPUs are happy for now and I will report if anything goes wrong again.
Ref: https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x
My solution is below: Add following commands in ~/.bashrc
export NCCL_P2P_DISABLE="1" export NCCL_IB_DISABLE="1"
Then, source ~/.bashrc. It works for me. No need to modify nccl -> gool.
This answer works too!! Glad no need to change from nccl to gloo.
Thanks!
My solution is below: Add following commands in ~/.bashrc
export NCCL_P2P_DISABLE="1" export NCCL_IB_DISABLE="1"
Then, source ~/.bashrc. It works for me. No need to modify nccl -> gool.
This will slow down the training speed?
I have been completely unable to get distributed training working with the Docker container even after following all of the suggestions in this thread. Has anybody else had better luck resolving this issue? I'm at my wits end here trying to get this to work.
I was trying to run distribute training on a VertexAI Workbench notebook. After 10000000 tries, I finally got it working in this way:
- Get a machine with Pytorch 2.2 installed (CUDA 12.1), and how many GPUs you want (4 in my case):
- DO NOT CREATE A CONDA ENV and install this in the base env:
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.0"
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
pip install -v -e .
- Change
mmcv_maximum_version = '2.2.0'
tommcv_maximum_version = '2.2.1'
inmmdetection/mmdet/__init__.py
for avoiding any versions mismatch. - Make sure that the
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/lib/x86_64-linux-gnu/
and NOT:LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/lib/x86_64-linux-gnu/:/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/lib/x86_64-linux-gnu/
, -- basically no duplicates -- or other variants (this happened when I was getting a machine with CUDA only, no PyTorch) -
export NCCL_P2P_DISABLE='1'
andexport NCCL_IB_DISABLE='1'
in the base env - run
bash mmdetection/tools/dist_train.sh <CONFIG_PATH> 4
Thank you all the others for your helpful guidance!