ultralytics icon indicating copy to clipboard operation
ultralytics copied to clipboard

torch.distributed.DistBackendError

Open yangershuai627 opened this issue 8 months ago β€’ 11 comments

Search before asking

  • [x] I have searched the Ultralytics YOLO issues and discussions and found no similar questions.

Question

One of the GPUs on the server (GPU 7) is faulty. Although I have explicitly excluded it from multi-GPU training (I'm only using GPUs 5 and 6), the training still throws an error. However, when I run training on any single GPU individually, it works fine, which is quite puzzling.

my_train:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "5,6"

from ultralytics import YOLO

model = YOLO("/ultralytics-main/cfg/models/v8/yolov8n-obb-DOTAv1-MaSA.yaml").load("/ultralytics-main/weights/yolov8n-obb.pt")

result = model.train(
    data="/ultralytics-main/cfg/datasets/DOTAv1.yaml",
    epochs=100,
    batch=8,
    device=[5,6], 
    imgsz=1024,
    val=None,
    workers=4
)

Error:

DDP: debug command /disk2/xiexingxing/lwb/anaconda3/envs/yolov8/bin/python -m torch.distributed.run --nproc_per_node 2 --master_port 40813 /disk2/xiexingxing/home/.config/Ultralytics/DDP/_temp_5q8cz_ok139902431240800.py
/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Ultralytics 8.3.13 πŸš€ Python-3.10.0 torch-2.6.0+cu124 CUDA:5 (NVIDIA GeForce RTX 2080 Ti, 11004MiB)
                                                      CUDA:6 (NVIDIA GeForce RTX 2080 Ti, 11004MiB)
Transferred 361/553 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLO11n...
AMP: checks passed βœ…
[rank1]: Traceback (most recent call last):
[rank1]:   File "/disk2/xiexingxing/home/.config/Ultralytics/DDP/_temp_5q8cz_ok139902431240800.py", line 13, in <module>
[rank1]:     results = trainer.train()
[rank1]:   File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 208, in train
[rank1]:     self._do_train(world_size)
[rank1]:   File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 328, in _do_train
[rank1]:     self._setup_train(world_size)
[rank1]:   File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 268, in _setup_train
[rank1]:     dist.broadcast(self.amp, src=0)  # broadcast the tensor from rank 0 to all other ranks (returns None)
[rank1]:   File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank1]:     work = group.broadcast([tensor], opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank1]: Last error:
[rank1]: nvmlDeviceGetHandleByIndex(7) failed: Unknown Error
[rank0]: Traceback (most recent call last):
[rank0]:   File "/disk2/xiexingxing/home/.config/Ultralytics/DDP/_temp_5q8cz_ok139902431240800.py", line 13, in <module>
[rank0]:     results = trainer.train()
[rank0]:   File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 208, in train
[rank0]:     self._do_train(world_size)
[rank0]:   File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 328, in _do_train
[rank0]:     self._setup_train(world_size)
[rank0]:   File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 268, in _setup_train
[rank0]:     dist.broadcast(self.amp, src=0)  # broadcast the tensor from rank 0 to all other ranks (returns None)
[rank0]:   File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank0]:     work = group.broadcast([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank0]: Last error:
[rank0]: nvmlDeviceGetHandleByIndex(7) failed: Unknown Error
W0422 21:48:46.921451 3802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3844 closing signal SIGTERM
E0422 21:48:47.036548 3802 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 3845) of binary: /disk2/xiexingxing/lwb/anaconda3/envs/yolov8/bin/python
Traceback (most recent call last):
  File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in <module>
    main()
  File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/disk2/xiexingxing/home/.config/Ultralytics/DDP/_temp_5q8cz_ok139902431240800.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-04-22_21:48:46
  host      : gpu7
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3845)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/disk2/xiexingxing/home/yes/ultralytics-main/DOTAv1/train-MaSA.py", line 8, in <module>
    result = model.train(
  File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/model.py", line 802, in train
    self.trainer.train()
  File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 203, in train
    raise e
  File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 201, in train
    subprocess.run(cmd, check=True)
  File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '40813', '/disk2/xiexingxing/home/.config/Ultralytics/DDP/_temp_5q8cz_ok139902431240800.py']' returned non-zero exit status 1.

Additional

No response

yangershuai627 avatar Apr 22 '25 14:04 yangershuai627

πŸ‘‹ Hello @yangershuai627, thank you for reaching out and providing detailed logs and context πŸš€! This is an automated response to help get you startedβ€”an Ultralytics engineer will also review your issue and provide further assistance soon.

We highly recommend reviewing the Docs for usage guides and troubleshooting tips. If your issue is a πŸ› Bug Report, please provide a minimum reproducible example (MRE) if you haven't already, as this helps us diagnose and resolve issues faster.

If your question relates to custom training, please share as much detail as possible, including dataset samples and training logs, and make sure you are following our Tips for Best Training Results.

Join the Ultralytics community in the way that suits you best:

  • For real-time help, visit our Discord 🎧
  • For in-depth discussions, head over to Discourse
  • Or connect with peers on our Subreddit

Upgrade

Please ensure you are using the latest ultralytics package and all requirements in a Python>=3.8 environment with PyTorch>=1.8. Sometimes, issues are resolved in more recent versions:

pip install -U ultralytics

Environments

YOLO runs reliably in any of these up-to-date verified environments (with CUDA/CUDNN, Python, and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLO Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Thank you for helping improve Ultralytics!

UltralyticsAssistant avatar Apr 22 '25 14:04 UltralyticsAssistant

@glenn-jocher @Y-T-G Could you help me?

yangershuai627 avatar Apr 22 '25 14:04 yangershuai627

Can you remove this?

os.environ["CUDA_VISIBLE_DEVICES"] = "5,6"

Y-T-G avatar Apr 22 '25 15:04 Y-T-G

Does nvidia-smi work?

Y-T-G avatar Apr 22 '25 15:04 Y-T-G

nvidia-smi unable to work

---- Replied Message ---- | From | Mohammed @.> | | Date | 04/22/2025 23:12 | | To | @.> | | Cc | Yang @.>@.> | | Subject | Re: [ultralytics/ultralytics] torch.distributed.DistBackendError (Issue #20284) |

Does nvidia-smi work?

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Y-T-G left a comment (ultralytics/ultralytics#20284)

Does nvidia-smi work?

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

yangershuai627 avatar Apr 22 '25 15:04 yangershuai627

I removed it, os.environ["CUDA_VISIBLE_DEVICES"] = "5,6", but it still doesn't work

---- Replied Message ---- | From | Mohammed @.> | | Date | 04/22/2025 23:12 | | To | @.> | | Cc | Yang @.>@.> | | Subject | Re: [ultralytics/ultralytics] torch.distributed.DistBackendError (Issue #20284) |

Does nvidia-smi work?

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Y-T-G left a comment (ultralytics/ultralytics#20284)

Does nvidia-smi work?

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

yangershuai627 avatar Apr 22 '25 15:04 yangershuai627

nvidia-smi unable to work

---- Replied Message ---- | From | Mohammed @.> | | Date | 04/22/2025 23:12 | | To | @.> | | Cc | Yang @.>@.> | | Subject | Re: [ultralytics/ultralytics] torch.distributed.DistBackendError (Issue #20284) |

Does nvidia-smi work?

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Y-T-G left a comment (ultralytics/ultralytics#20284)

Does nvidia-smi work?

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Then you need to reboot the server

Y-T-G avatar Apr 22 '25 15:04 Y-T-G

Is there any other way besides rebooting the server? Because there are other people's programs running on it.

yangershuai627 avatar Apr 23 '25 01:04 yangershuai627

The issue seems to be that NCCL is still trying to access GPU 7 (nvmlDeviceGetHandleByIndex(7) failed) even though you're only requesting GPUs 5 and 6.

Since you can't reboot the server, try setting these NCCL environment variables before your script to limit device visibility:

import os
os.environ["NCCL_VISIBLE_DEVICES"] = "5,6"  # Limit NCCL to only see GPUs 5 and 6
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"  # Make CUDA device IDs match nvidia-smi order

# Then your training code
from ultralytics import YOLO
model = YOLO("path/to/model")
result = model.train(
    data="path/to/data",
    epochs=100,
    batch=8,
    device=[0,1],  # Use local indices (0,1) instead of global (5,6)
    imgsz=1024
)

This should prevent NCCL from attempting to access the faulty GPU 7 during initialization.

glenn-jocher avatar Apr 23 '25 13:04 glenn-jocher

Thank you very much! This is useful for me.

yangershuai627 avatar Apr 23 '25 13:04 yangershuai627

I'm glad the solution worked for you! This is a common issue when working with distributed training on systems with problematic GPUs.

The environment variables approach is particularly useful in shared environments where rebooting isn't an option. If you encounter similar issues in the future, you might also consider setting NCCL_IGNORE_DISABLED_P2P=1 to help with other types of inter-GPU communication problems.

Happy training with your YOLO model!

glenn-jocher avatar Apr 23 '25 17:04 glenn-jocher

πŸ‘‹ Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

  • Docs: https://docs.ultralytics.com
  • HUB: https://hub.ultralytics.com
  • Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO πŸš€ and Vision AI ⭐

github-actions[bot] avatar May 24 '25 00:05 github-actions[bot]