torch.distributed.DistBackendError
Search before asking
- [x] I have searched the Ultralytics YOLO issues and discussions and found no similar questions.
Question
One of the GPUs on the server (GPU 7) is faulty. Although I have explicitly excluded it from multi-GPU training (I'm only using GPUs 5 and 6), the training still throws an error. However, when I run training on any single GPU individually, it works fine, which is quite puzzling.
my_train:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "5,6"
from ultralytics import YOLO
model = YOLO("/ultralytics-main/cfg/models/v8/yolov8n-obb-DOTAv1-MaSA.yaml").load("/ultralytics-main/weights/yolov8n-obb.pt")
result = model.train(
data="/ultralytics-main/cfg/datasets/DOTAv1.yaml",
epochs=100,
batch=8,
device=[5,6],
imgsz=1024,
val=None,
workers=4
)
Error:
DDP: debug command /disk2/xiexingxing/lwb/anaconda3/envs/yolov8/bin/python -m torch.distributed.run --nproc_per_node 2 --master_port 40813 /disk2/xiexingxing/home/.config/Ultralytics/DDP/_temp_5q8cz_ok139902431240800.py
/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
Ultralytics 8.3.13 π Python-3.10.0 torch-2.6.0+cu124 CUDA:5 (NVIDIA GeForce RTX 2080 Ti, 11004MiB)
CUDA:6 (NVIDIA GeForce RTX 2080 Ti, 11004MiB)
Transferred 361/553 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLO11n...
AMP: checks passed β
[rank1]: Traceback (most recent call last):
[rank1]: File "/disk2/xiexingxing/home/.config/Ultralytics/DDP/_temp_5q8cz_ok139902431240800.py", line 13, in <module>
[rank1]: results = trainer.train()
[rank1]: File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 208, in train
[rank1]: self._do_train(world_size)
[rank1]: File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 328, in _do_train
[rank1]: self._setup_train(world_size)
[rank1]: File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 268, in _setup_train
[rank1]: dist.broadcast(self.amp, src=0) # broadcast the tensor from rank 0 to all other ranks (returns None)
[rank1]: File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank1]: work = group.broadcast([tensor], opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: nvmlDeviceGetHandleByIndex(7) failed: Unknown Error
[rank0]: Traceback (most recent call last):
[rank0]: File "/disk2/xiexingxing/home/.config/Ultralytics/DDP/_temp_5q8cz_ok139902431240800.py", line 13, in <module>
[rank0]: results = trainer.train()
[rank0]: File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 208, in train
[rank0]: self._do_train(world_size)
[rank0]: File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 328, in _do_train
[rank0]: self._setup_train(world_size)
[rank0]: File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 268, in _setup_train
[rank0]: dist.broadcast(self.amp, src=0) # broadcast the tensor from rank 0 to all other ranks (returns None)
[rank0]: File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank0]: work = group.broadcast([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank0]: Last error:
[rank0]: nvmlDeviceGetHandleByIndex(7) failed: Unknown Error
W0422 21:48:46.921451 3802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3844 closing signal SIGTERM
E0422 21:48:47.036548 3802 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 3845) of binary: /disk2/xiexingxing/lwb/anaconda3/envs/yolov8/bin/python
Traceback (most recent call last):
File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in <module>
main()
File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/disk2/xiexingxing/home/.config/Ultralytics/DDP/_temp_5q8cz_ok139902431240800.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-04-22_21:48:46
host : gpu7
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3845)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
File "/disk2/xiexingxing/home/yes/ultralytics-main/DOTAv1/train-MaSA.py", line 8, in <module>
result = model.train(
File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/model.py", line 802, in train
self.trainer.train()
File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 203, in train
raise e
File "/disk2/xiexingxing/home/yes/ultralytics-main/ultralytics/engine/trainer.py", line 201, in train
subprocess.run(cmd, check=True)
File "/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/disk2/xiexingxing/lwb/anaconda3/envs/yolov8/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '40813', '/disk2/xiexingxing/home/.config/Ultralytics/DDP/_temp_5q8cz_ok139902431240800.py']' returned non-zero exit status 1.
Additional
No response
π Hello @yangershuai627, thank you for reaching out and providing detailed logs and context π! This is an automated response to help get you startedβan Ultralytics engineer will also review your issue and provide further assistance soon.
We highly recommend reviewing the Docs for usage guides and troubleshooting tips. If your issue is a π Bug Report, please provide a minimum reproducible example (MRE) if you haven't already, as this helps us diagnose and resolve issues faster.
If your question relates to custom training, please share as much detail as possible, including dataset samples and training logs, and make sure you are following our Tips for Best Training Results.
Join the Ultralytics community in the way that suits you best:
- For real-time help, visit our Discord π§
- For in-depth discussions, head over to Discourse
- Or connect with peers on our Subreddit
Upgrade
Please ensure you are using the latest ultralytics package and all requirements in a Python>=3.8 environment with PyTorch>=1.8. Sometimes, issues are resolved in more recent versions:
pip install -U ultralytics
Environments
YOLO runs reliably in any of these up-to-date verified environments (with CUDA/CUDNN, Python, and PyTorch preinstalled):
- Notebooks with free GPU:
- Google Cloud Deep Learning VM. See GCP Quickstart Guide
- Amazon Deep Learning AMI. See AWS Quickstart Guide
- Docker Image. See Docker Quickstart Guide
Status
If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLO Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.
Thank you for helping improve Ultralytics!
@glenn-jocher @Y-T-G Could you help me?
Can you remove this?
os.environ["CUDA_VISIBLE_DEVICES"] = "5,6"
Does nvidia-smi work?
nvidia-smi unable to work
---- Replied Message ---- | From | Mohammed @.> | | Date | 04/22/2025 23:12 | | To | @.> | | Cc | Yang @.>@.> | | Subject | Re: [ultralytics/ultralytics] torch.distributed.DistBackendError (Issue #20284) |
Does nvidia-smi work?
β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Y-T-G left a comment (ultralytics/ultralytics#20284)
Does nvidia-smi work?
β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
I removed it, os.environ["CUDA_VISIBLE_DEVICES"] = "5,6", but it still doesn't work
---- Replied Message ---- | From | Mohammed @.> | | Date | 04/22/2025 23:12 | | To | @.> | | Cc | Yang @.>@.> | | Subject | Re: [ultralytics/ultralytics] torch.distributed.DistBackendError (Issue #20284) |
Does nvidia-smi work?
β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Y-T-G left a comment (ultralytics/ultralytics#20284)
Does nvidia-smi work?
β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
nvidia-smi unable to work
---- Replied Message ---- | From | Mohammed @.> | | Date | 04/22/2025 23:12 | | To | @.> | | Cc | Yang @.>@.> | | Subject | Re: [ultralytics/ultralytics] torch.distributed.DistBackendError (Issue #20284) |
Does nvidia-smi work?
β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Y-T-G left a comment (ultralytics/ultralytics#20284)
Does nvidia-smi work?
β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Then you need to reboot the server
Is there any other way besides rebooting the server? Because there are other people's programs running on it.
The issue seems to be that NCCL is still trying to access GPU 7 (nvmlDeviceGetHandleByIndex(7) failed) even though you're only requesting GPUs 5 and 6.
Since you can't reboot the server, try setting these NCCL environment variables before your script to limit device visibility:
import os
os.environ["NCCL_VISIBLE_DEVICES"] = "5,6" # Limit NCCL to only see GPUs 5 and 6
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" # Make CUDA device IDs match nvidia-smi order
# Then your training code
from ultralytics import YOLO
model = YOLO("path/to/model")
result = model.train(
data="path/to/data",
epochs=100,
batch=8,
device=[0,1], # Use local indices (0,1) instead of global (5,6)
imgsz=1024
)
This should prevent NCCL from attempting to access the faulty GPU 7 during initialization.
Thank you very much! This is useful for me.
I'm glad the solution worked for you! This is a common issue when working with distributed training on systems with problematic GPUs.
The environment variables approach is particularly useful in shared environments where rebooting isn't an option. If you encounter similar issues in the future, you might also consider setting NCCL_IGNORE_DISABLED_P2P=1 to help with other types of inter-GPU communication problems.
Happy training with your YOLO model!
π Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.
For additional resources and information, please see the links below:
- Docs: https://docs.ultralytics.com
- HUB: https://hub.ultralytics.com
- Community: https://community.ultralytics.com
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLO π and Vision AI β