mmdetection icon indicating copy to clipboard operation
mmdetection copied to clipboard

Multi-gpu training gets stuck

Open vakker opened this issue 3 years ago • 27 comments

Checklist

  1. [x] I have searched related issues but cannot get the expected help.
  2. [x] I have read the FAQ documentation but cannot get the expected help.
  3. [x] The bug has not been fixed in the latest version.

Describe the bug

Single GPU training works fine, single node multi-GPU doesn't. Relevant: #3823, #2193, #1979, #4535, maybe #3973 Rolling back intel-openmp doesn't help, and it uses only default configs.

I.e. running this is ok:

python tools/train.py configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py

Outputs:

...
2021-11-17 17:33:20,582 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:33:20,583 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:33:20,584 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:33:20,586 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:33:20,588 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,589 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:33:20,598 - mmdet - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}
2021-11-17 17:33:20,615 - mmdet - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01}                                                                                       
2021-11-17 17:33:20,621 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg
'}}, {'type': 'Xavier', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}]                                                                                                           
loading annotations into memory...
Done (t=0.44s)                                                                                                                                                                                                     
creating index...                                                                                        
index created!                                                                                                                                                                                                     
loading annotations into memory...                
Done (t=0.06s)                                                                                                                                                                                                     
creating index...        
index created!                                                                                                                                                                                                     
2021-11-17 17:33:23,497 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
2021-11-17 17:33:23,597 - mmdet - WARNING - The model and loaded state dict do not match exactly                                                                                                                   
                                                                                                         
size mismatch for roi_head.bbox_head.fc_cls.weight: copying a param with shape torch.Size([81, 1024]) from checkpoint, the shape in current model is torch.Size([9, 1024]).                                        
size mismatch for roi_head.bbox_head.fc_cls.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([9]).                                                      
size mismatch for roi_head.bbox_head.fc_reg.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([32, 1024]).   
size mismatch for roi_head.bbox_head.fc_reg.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([32]).                                                    
2021-11-17 17:33:23,600 - mmdet - INFO - Start running, host: vince@wombat, work_dir: /home/vince/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_cityscapes
2021-11-17 17:33:23,600 - mmdet - INFO - Hooks will be executed in the following order:                                                                                                                            
before_run:                                                                                              
(VERY_HIGH   ) StepLrUpdaterHook                                                                         
(NORMAL      ) CheckpointHook                     
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                          
...     
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
2021-11-17 17:33:23,600 - mmdet - INFO - workflow: [('train', 1)], max: 8 epochs
2021-11-17 17:33:23,600 - mmdet - INFO - Checkpoints will be saved to /home/vince/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_cityscapes by HardDiskBackend.
2021-11-17 17:33:54,886 - mmdet - INFO - Epoch [1][100/23720]   lr: 1.988e-03, eta: 16:25:12, time: 0.312, data_time: 0.025, memory: 4024, loss_rpn_cls: 0.0439, loss_rpn_bbox: 0.0921, loss_cls: 0.7297, acc: 80.1914, loss_bbox: 0.3776, loss: 1.2433
2021-11-17 17:34:23,521 - mmdet - INFO - Epoch [1][200/23720]   lr: 3.986e-03, eta: 15:44:40, time: 0.286, data_time: 0.004, memory: 4073, loss_rpn_cls: 0.0450, loss_rpn_bbox: 0.1027, loss_cls: 0.3770, acc: 86.8848, loss_bbox: 0.2467, loss: 0.7713

Running this is not ok:

./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2

Outputs:

...
2021-11-17 17:31:12,229 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,230 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,231 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,232 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,233 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,233 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,234 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,236 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,237 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,238 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,239 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,241 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,242 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,246 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,250 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,253 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,270 - mmdet - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}                                                                             
2021-11-17 17:31:12,287 - mmdet - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01}                                                                                       
2021-11-17 17:31:12,290 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg
'}}, {'type': 'Xavier', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}]                                                                                                           
loading annotations into memory...                                                                                                                                                                                 
Done (t=0.42s)                                                                                                                                                                                                     
creating index...                                                                                                                                                                                                  
index created!                                                                                                                                                                                                     
loading annotations into memory...                                                                                                                                                                                 
Done (t=0.06s)                                                                                                                                                                                                     
creating index...                                                                                                                                                                                                  
index created!                                                                                                                                                                                                     
2021-11-17 17:31:13,772 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth

and then just waits.

Environment

sys.platform: linux
Python: 3.8.12 (default, Nov 17 2021, 08:17:37) [GCC 9.3.0]
CUDA available: True
GPU 0,1: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.5.r11.5/compiler.30411180_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.0+cu113
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.11.1+cu113
OpenCV: 4.5.4-dev
MMCV: 1.3.17
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.18.1+c76ab0e

vakker avatar Nov 17 '21 17:11 vakker

Seems stuck at loading checkpoint? Could you interrupt the program and see where it really stuck?

RangiLyu avatar Nov 18 '21 12:11 RangiLyu

@RangiLyu after CTRL-C it gives this:

2021-11-21 09:34:37,477 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
^CWARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers                                                                                                                
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4006362 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4006363 closing signal SIGINT                                                                                                                
^CWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4006362 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4006363 closing signal SIGTERM                                                                                                               
^CTraceback (most recent call last):
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run     
    result = self._invoke_run(role)                                                                      
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
    time.sleep(monitor_interval)   
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4006288 got signal: 2
                                                    
During handling of the above exception, another exception occurred:                                                                                                                                                
                                                                                                         
Traceback (most recent call last):                                                                       
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 716, in run
    self._shutdown(e.sigval)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 190, in _shutdown
    self._pcontext.close(death_sig)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 330, in close
    self._close(death_sig=death_sig, timeout=timeout)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 720, in _close
    handler.proc.wait(time_to_wait)
  File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/subprocess.py", line 1083, in wait
    return self._wait(timeout=timeout)
  File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/subprocess.py", line 1800, in _wait
    time.sleep(delay)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4006288 got signal: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
    result = agent.run()
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 721, in run
    self._shutdown()
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 190, in _shutdown
    self._pcontext.close(death_sig)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 330, in close
    self._close(death_sig=death_sig, timeout=timeout)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 720, in _close
    handler.proc.wait(time_to_wait)
  File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/subprocess.py", line 1083, in wait
    return self._wait(timeout=timeout)
  File "/home/vince/.pyenv/versions/3.8.12/lib/python3.8/subprocess.py", line 1800, in _wait
    time.sleep(delay)
  File "/home/vince/.pyenv/versions/cm2-p1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4006288 got signal: 2

Does that help?

Also note, with 1 training worker it works fine, i.e.:

./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 1

but with 2 it gets stuck:

./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2

vakker avatar Nov 21 '21 09:11 vakker

I did a bit of digging and switching to torchrun (as the deprecation warning says) makes it work. However, train.py needs to be changed slightly. I opened a PR (#6557) with the changes, let me know what you think.

vakker avatar Nov 21 '21 11:11 vakker

@vakker Do you know why torch.distributed.launch will block?

hhaAndroid avatar Nov 22 '21 02:11 hhaAndroid

Actually, the issue is not present because I didn't use the --launcher pytorch, so the distributed flag was not set at all.

If I set that (either for torchrun or torch.distributed.launch) then it hangs:

  1. in seed = init_random_seed(args.seed) on the dist.broadcast(random_num, src=0) line
  2. in train_detector on the model = MMDistributedDataParallel line

So, there might be multiple issues related to how the distributed training is initialized maybe.

vakker avatar Nov 22 '21 09:11 vakker

@hhaAndroid do you have any input on moving this further?

vakker avatar Dec 02 '21 08:12 vakker

@RangiLyu @hhaAndroid could you confirm that the issue can be reproduced on your end as well (just run the 2 commands in the issue description)? I'm happy to investigate this further, but it's unclear at this point if this is a broader issue or there's just something wrong with my setup.

vakker avatar Dec 09 '21 09:12 vakker

@RangiLyu @hhaAndroid do you have any update on this issue?

vakker avatar Jan 12 '22 22:01 vakker

We support three kinds of distributed launcher in MMCV https://github.com/open-mmlab/mmcv/blob/86ed509a8bc783cdd0617efeba257375d5aa6658/mmcv/runner/dist_utils.py#L14

If you did not use the --launcher pytorch, the init_dist will not be triggered and some env variables will not be set. Would you like to check if this caused the problem?

RangiLyu avatar Jan 17 '22 10:01 RangiLyu

Thanks for the response.

That's how I run it, please see my previous post:

If I set that (either for torchrun or torch.distributed.launch) then it hangs:

1. in `seed = init_random_seed(args.seed)` on the `dist.broadcast(random_num, src=0)` line

2. in `train_detector` on the `model = MMDistributedDataParallel` line

I.e. the issue is present when the pytorch launcher is used.

Could you please confirm at least the you're able to reproduce the issue?

vakker avatar Jan 17 '22 13:01 vakker

@RangiLyu I just tried again today from the latest master and the issue still persists. Just to reiterate the reproduction steps: This works:

python tools/train.py configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py

This gets stuck:

bash ./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2
Env info:
sys.platform: linux
Python: 3.8.12 (default, Nov 17 2021, 08:17:37) [GCC 9.3.0]
CUDA available: True
GPU 0,1: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.5.r11.5/compiler.30411180_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.0+cu113
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.11.1+cu113
OpenCV: 4.5.4-dev
MMCV: 1.3.17
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.18.1+a7a16af

Please let me know if you need any more information. Thanks!

vakker avatar Jan 19 '22 10:01 vakker

Hi @vakker , It might be caused by limited memory. You may try workers_per_gpu=0 or 1.

ZwwWayne avatar Jan 26 '22 16:01 ZwwWayne

@ZwwWayne Thanks for the suggestion, but that's not the issue. Just to be sure, I tried workers_per_gpu=0 (and also the default 2), and the issue persists, while there's more than 50GB of ram available on the machine.

If you run the reproduction steps above, do you experience the same or not? That way it would be easy to narrow down the possibilities. Thanks!

vakker avatar Jan 26 '22 18:01 vakker

I tried but could not reproduce this problem.

RangiLyu avatar Jan 28 '22 06:01 RangiLyu

@RangiLyu thanks, so just to be sure, you run bash ./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2 and it didn't get stuck, right? Then could you provide your environment info, so I can compare and see what causes this? Thanks

vakker avatar Jan 28 '22 08:01 vakker

@RangiLyu thanks, so just to be sure, you run bash ./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2 and it didn't get stuck, right? Then could you provide your environment info, so I can compare and see what causes this? Thanks

Yes, I run bash ./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2 the same as you provided. But the difference is that I used srun to start the dist_train.sh because our cluster uses slurm, but I still used the pytorch launcher.

And here is my environment:

sys.platform: linux                                                                                                                                                                                                                         
Python: 3.6.13 | packaged by conda-forge | (default, Sep 23 2021, 07:56:31) [GCC 9.4.0]                                                                                                                                                     
CUDA available: True                                                                                                                                                                                                                        
GPU 0,1: TITAN Xp                                                                                                                                                                                                                           
CUDA_HOME: /mnt/lustre/share/polaris/dep/cuda-9.0-cudnn7.6.5                                                                                                                                                                                
NVCC: Cuda compilation tools, release 9.0, V9.0.176                                                                                                                                                                                         
GCC: gcc (GCC) 5.4.0                                                                                                                                                                                                                        
PyTorch: 1.8.1+cuda90.cudnn7.6.5                                                                                                                                                                                                            
PyTorch compiling details: PyTorch built with:                                                                                                                                                                                              
  - GCC 5.4                                                                                                                                                                                                                                 
  - C++ Version: 201402                                                                                                                                                                                                                     
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications                                                                                                                          
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)                                                                                                                                                             
  - OpenMP 201307 (a.k.a. OpenMP 4.0)                                                                                                                                                                                                       
  - NNPACK is enabled                                                                                                                                                                                                                       
  - CPU capability usage: AVX2                                                                                                                                                                                                              
  - CUDA Runtime 9.0                                                                                                                                                                                                                        
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70                                                                
  - CuDNN 7.6.5                                                                                                                                                                                                                             
  - Magma 2.5.0                                                                                                                                                                                                                             
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=9.0, CUDNN_VERSION=7.6.5, CXX_COMPILER=/usr/local/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp 
-DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare
 -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-psabi -Wno-error=pedantic -Wno-error=redund
ant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512
=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,                                                   
                                                                                                                                                                                                                                            
TorchVision: 0.8.0a0+2f40a48                                                                                                                                                                                                                
OpenCV: 4.5.4                                                                                                                                                                                                                               
MMCV: 1.4.0                                                                                                                                                                                                                                 
MMCV Compiler: GCC 5.4                                                                                                                                                                                                                      
MMCV CUDA Compiler: 9.0                                                                                                                                                                                                                     
MMDetection: 2.20.0+ff9bc39                          

RangiLyu avatar Jan 29 '22 06:01 RangiLyu

@RangiLyu thanks for the details. I'll try to match your setup in terms of library versions, as you're using different Python, Pytorch, Torchvision, Cuda, CuDNN versions, and see if I can reproduce the issue.

vakker avatar Feb 04 '22 09:02 vakker

@RangiLyu @hhaAndroid I was able to reproduce this issue with Docker, I tried a broad range of settings:

Pytorch Cuda CuDNN Works
1.6.0 10.1 7 :heavy_check_mark:
1.7.0 11.0 8 :x:
1.8.0 11.1 8 :x:
1.9.0 10.2 7 :x:
1.9.0 11.1 8 :x:
1.10.0 11.3 8 :x:
Dockerfile

Base on the official Dockerfile:

ARG PYTORCH="1.6.0"
ARG CUDA="10.1"
ARG CUDNN="7"


FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel

ARG MMCV="1.4.6"

ARG PYTORCH
ARG CUDA

ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"

RUN apt-get update && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install MMCV
RUN pip install --no-cache-dir --upgrade pip wheel setuptools
RUN ["/bin/bash", "-c", "pip install mmcv-full==${MMCV} -f https://download.openmmlab.com/mmcv/dist/cu${CUDA//./}/torch${PYTORCH}/index.html"]

# Install MMDetection
RUN conda clean --all
RUN git clone https://github.com/open-mmlab/mmdetection.git /mmdetection
WORKDIR /mmdetection
ENV FORCE_CUDA="1"
RUN pip install --no-cache-dir -r requirements/build.txt
RUN pip install --no-cache-dir -e .

Build:

docker build \
    --build-arg PYTORCH=<pytorch> \
    --build-arg CUDA=<cuda> \
    --build-arg CUDNN=<cudnn> \
    -t mmdet:<pytorch>-<cuda>-cudnn> .

After some digging, I found this discussion, so I tried using the gloo backend instead of nccl, i.e. in configs/_base_/default_runtime.py I changed dist_params = dict(backend='nccl') to dist_params = dict(backend='gloo').

That does make it work on all the test scenarios:

Pytorch Cuda CuDNN Works
1.6.0 10.1 7 :heavy_check_mark:
1.7.0 11.0 8 :heavy_check_mark:
1.8.0 11.1 8 :heavy_check_mark:
1.9.0 10.2 7 :heavy_check_mark:
1.9.0 11.1 8 :heavy_check_mark:
1.10.0 11.3 8 :heavy_check_mark:

Please note, that the Cuda version that you used (9.0) to reproduce this issue is really old, it was released in 2017, I couldn't even find a Pytorch Docker image for that. It might make sense to test the code using versions that are more commonly used recently (10.2, 11.3). I'm not sure Pytorch even supports anything less than 10.2, see here.

Please, verify if this is an actual bug that you can reproduce.

vakker avatar Mar 07 '22 23:03 vakker

Hi @vakker @RangiLyu @ZwwWayne I experienced the same problem, what should be the recommended fix here? Using torchrun or down-grade pytorch version?

DangChuong-DC avatar Mar 22 '22 15:03 DangChuong-DC

@DangChuong-DC I'm not sure what the right solution is, the MMLab team should reproduce this and investigate further. It seems like that changing the dist backend to gloo instead of nccl works, but it might have performance implications. Maybe rewriting the dist_train.sh script to work with torchrun should be the way, but that'll require changes in train.py too.

vakker avatar Apr 01 '22 10:04 vakker

@RangiLyu @hhaAndroid I was able to reproduce this issue with Docker, I tried a broad range of settings:

Pytorch Cuda CuDNN Works 1.6.0 10.1 7 ✔️ 1.7.0 11.0 8 ❌ 1.8.0 11.1 8 ❌ 1.9.0 10.2 7 ❌ 1.9.0 11.1 8 ❌ 1.10.0 11.3 8 ❌ Dockerfile Base on the official Dockerfile:

ARG PYTORCH="1.6.0"
ARG CUDA="10.1"
ARG CUDNN="7"


FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel

ARG MMCV="1.4.6"

ARG PYTORCH
ARG CUDA

ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"

RUN apt-get update && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install MMCV
RUN pip install --no-cache-dir --upgrade pip wheel setuptools
RUN ["/bin/bash", "-c", "pip install mmcv-full==${MMCV} -f https://download.openmmlab.com/mmcv/dist/cu${CUDA//./}/torch${PYTORCH}/index.html"]

# Install MMDetection
RUN conda clean --all
RUN git clone https://github.com/open-mmlab/mmdetection.git /mmdetection
WORKDIR /mmdetection
ENV FORCE_CUDA="1"
RUN pip install --no-cache-dir -r requirements/build.txt
RUN pip install --no-cache-dir -e .

Build:

docker build \
    --build-arg PYTORCH=<pytorch> \
    --build-arg CUDA=<cuda> \
    --build-arg CUDNN=<cudnn> \
    -t mmdet:<pytorch>-<cuda>-cudnn> .

After some digging, I found this discussion, so I tried using the gloo backend instead of nccl, i.e. in configs/_base_/default_runtime.py I changed dist_params = dict(backend='nccl') to dist_params = dict(backend='gloo').

That does make it work on all the test scenarios:

Pytorch Cuda CuDNN Works 1.6.0 10.1 7 ✔️ 1.7.0 11.0 8 ✔️ 1.8.0 11.1 8 ✔️ 1.9.0 10.2 7 ✔️ 1.9.0 11.1 8 ✔️ 1.10.0 11.3 8 ✔️ Please note, that the Cuda version that you used (9.0) to reproduce this issue is really old, it was released in 2017, I couldn't even find a Pytorch Docker image for that. It might make sense to test the code using versions that are more commonly used recently (10.2, 11.3). I'm not sure Pytorch even supports anything less than 10.2, see here.

Please, verify if this is an actual bug that you can reproduce.

I experienced a similar problem. It works for me. The outputs are as follows.

2022-04-01 12:15:36,770 - mmseg - INFO - Set random seed to 0, deterministic: False
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
2022-04-01 12:15:37,557 - mmseg - INFO - initialize ResNetV1c with init_cfg {'type': 'Pretrained', 'checkpoint': 'open-mmlab://resnet101_v1c'}
2022-04-01 12:15:37,558 - mmcv - INFO - load model from: open-mmlab://resnet101_v1c
2022-04-01 12:15:37,558 - mmcv - INFO - load checkpoint from openmmlab path: open-mmlab://resnet101_v1c
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800337 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800337 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800795 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800809 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800832 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800795 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800832 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801043 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800809 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801043 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2000 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2001 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2002 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2004 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2005 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2006 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 2003) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
tools/train.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2022-04-01_12:45:45
  host      : d7537e2e1710
  rank      : 7 (local_rank: 7)
  exitcode  : -6 (pid: 2007)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2007
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-04-01_12:45:45
  host      : d7537e2e1710
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 2003)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2003
=====================================================

I also changed dist_params = dict(backend='nccl') to dist_params = dict(backend='gloo').

It works for me. Thanks!

aspenstarss avatar Apr 01 '22 13:04 aspenstarss

My solution is below: Add following commands in ~/.bashrc

export NCCL_P2P_DISABLE="1" export NCCL_IB_DISABLE="1"

Then, source ~/.bashrc. It works for me. No need to modify nccl -> gool.

jiwei0921 avatar Apr 21 '22 15:04 jiwei0921

Thank you but none of the above works for me : ( After days of searching and trying, my workaround is to set the system run level to 3:

sudo init 3

The GPUs are happy for now and I will report if anything goes wrong again.

Ref: https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x

Hanqing-Sun avatar Sep 03 '22 08:09 Hanqing-Sun

My solution is below: Add following commands in ~/.bashrc

export NCCL_P2P_DISABLE="1" export NCCL_IB_DISABLE="1"

Then, source ~/.bashrc. It works for me. No need to modify nccl -> gool.

This answer works too!! Glad no need to change from nccl to gloo.

Thanks!

pengyu965 avatar Dec 02 '22 19:12 pengyu965

My solution is below: Add following commands in ~/.bashrc

export NCCL_P2P_DISABLE="1" export NCCL_IB_DISABLE="1"

Then, source ~/.bashrc. It works for me. No need to modify nccl -> gool.

This will slow down the training speed?

lifeiteng avatar Aug 25 '23 05:08 lifeiteng

I have been completely unable to get distributed training working with the Docker container even after following all of the suggestions in this thread. Has anybody else had better luck resolving this issue? I'm at my wits end here trying to get this to work.

h-fernand avatar Nov 20 '23 20:11 h-fernand

I was trying to run distribute training on a VertexAI Workbench notebook. After 10000000 tries, I finally got it working in this way:

  1. Get a machine with Pytorch 2.2 installed (CUDA 12.1), and how many GPUs you want (4 in my case):
  2. DO NOT CREATE A CONDA ENV and install this in the base env:
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.0"

git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
pip install -v -e .
  1. Change mmcv_maximum_version = '2.2.0' to mmcv_maximum_version = '2.2.1' in mmdetection/mmdet/__init__.py for avoiding any versions mismatch.
  2. Make sure that the LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/lib/x86_64-linux-gnu/ and NOT: LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/lib/x86_64-linux-gnu/:/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/lib/x86_64-linux-gnu/, -- basically no duplicates -- or other variants (this happened when I was getting a machine with CUDA only, no PyTorch)
  3. export NCCL_P2P_DISABLE='1' and export NCCL_IB_DISABLE='1' in the base env
  4. run bash mmdetection/tools/dist_train.sh <CONFIG_PATH> 4

Thank you all the others for your helpful guidance!

MihaiDavid05 avatar Jul 24 '24 15:07 MihaiDavid05