mmdetection icon indicating copy to clipboard operation
mmdetection copied to clipboard

RuntimeError: Distributed package doesn't have NCCL built in

Open pabsan-0 opened this issue 3 years ago • 5 comments

Describe the bug Benchmarking script breaks on Jetson Xavier NX & Jetson TX2 with error message RuntimeError: Distributed package doesn't have NCCL built in.

Reproduction After clean install of mmdetection following the best practices guide:

python3 -m torch.distributed.launch --nproc_per_node=1 tools/analysis_tools/benchmark.py "$CFG" "$WEIGHTS" --launcher pytorch

Environment Docker image built from docker run --rm -ti --runtime nvidia nvcr.io/nvidia/l4t-ml:r32.6.1-py3, manually installing everythin else on top. Also seen to happen outside docker environment. The Docker machine yielded the following env:

Click to expand
root@nvidia-desktop:/mmdetection# python3 mmdet/utils/collect_env.py
sys.platform: linux
Python: 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
CUDA available: True
GPU 0: Xavier
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.2, V10.2.300
GCC: aarch64-linux-gnu-gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.9.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.5
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_53,code=sm_53;-gencode;arch=compute_62,code=sm_62;-gencode;arch=compute_72,code=sm_72
  - CuDNN 8.2.1
    - Built with CuDNN 8.0
  - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=8.0.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -DMISSING_ARM_VST1 -DMISSING_ARM_VLD1 -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=open, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=0, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.10.0a0+300a8a4
OpenCV: 4.5.0
MMCV: 1.5.3
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.2
MMDetection: 2.25.0+56e42e7

Error traceback

Click to expand
/usr/local/lib/python3.6/dist-packages/torchvision/io/image.py:11: UserWarning: Failed to load image Python extension:
  warn(f"Failed to load image Python extension: {e}")
Traceback (most recent call last):
  File "tools/analysis_tools/benchmark.py", line 195, in <module>
    main()
  File "tools/analysis_tools/benchmark.py", line 187, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/home/catec/.local/lib/python3.6/site-packages/mmcv/runner/dist_utils.py", line 41, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/home/catec/.local/lib/python3.6/site-packages/mmcv/runner/dist_utils.py", line 64, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 597, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in
Killing subprocess 21275
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/analysis_tools/benchmark.py', '--local_rank=0', '../jetson_nano/retinanet_swin-t-p4-w7_fpn_1x_coco_AIRPLANE/retinanet_swin-t-p4-w7_fpn_1x_coco_AIRPLANE.py', '../jetson_nano/retinanet_swin-t-p4-w7_fpn_1x_coco_AIRPLANE/epoch_100.pth', '--launcher', 'pytorch']' returned non-zero exit status 1.

Bug fix Adding the (+) line in the following file overcomes the issue so that the benchmarking can continue:

$ sudo vim /usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py
  
  prefix_store = PrefixStore(group_name, store)
+ backend = "gloo"
  if backend == Backend.GLOO:
      pg = ProcessGroupGloo(
          prefix_store,
          rank,
          world_size,
          timeout=timeout)
      _pg_map[pg] = (Backend.GLOO, store)
      _pg_names[pg] = group_name
  elif backend == Backend.NCCL:
      if not is_nccl_available():
          raise RuntimeError("Distributed package doesn't have NCCL "
                             "built in")

This fix was taken from here and reposted for visibility.

pabsan-0 avatar Jul 05 '22 08:07 pabsan-0

I have the same problem as you did, bro. Is the problem solved later with the extra line "backend = "gloo""?

Neil-untitled avatar Sep 10 '22 01:09 Neil-untitled

@Neil-untitled What do you mean by later? Adding that line allowed me to run that benchmarking script I was trying to get working, but I didn't do much afterwards so I can't tell whether it is possible for something to break elsewhere.

pabsan-0 avatar Sep 12 '22 06:09 pabsan-0

Thank you very much for replying. I tried your method and it actually worked! Now I can run benchmark.py on my XavierNX. I am just curious about if Jetpack supports NCCL? I also tried to install NCCL from source on my Xavier but it didn't work.

Neil-untitled avatar Sep 12 '22 06:09 Neil-untitled

Glad you managed! I'm afraid I can't help you regarding NCCL as I am not familiar with it.

pabsan-0 avatar Sep 12 '22 07:09 pabsan-0

No worries mate. Your post already saves my life to be honest. I would be stuck there wondering how to solve the problem for a couple of days otherwise if I didn't see this. Thanks again!

Neil-untitled avatar Sep 12 '22 08:09 Neil-untitled