DeepSpeed
DeepSpeed copied to clipboard
[BUG] - Multiple 5090s failing on deepspeed.initialize()
Describe the bug
The developer of the training code Diffusion-pipe helped me debug this, the issue on that repository has all the relevant information that I have now. His summary:
So plain PyTorch GPU communication ops work. But deepspeed.initialize() is always failing when it does its version of cross-GPU communication. Myself and other users have this working, but it fails specifically with multiple 5090s, and you are probably the only person who has tried that setup.
I would raise an issue with Deepspeed. I don't think I've done anything wrong in the application code, and it is likely an internal Deepspeed problem. Without being able to reproduce the error myself, there's not much more I can do.
Full issue: https://github.com/tdrussell/diffusion-pipe/issues/235#issuecomment-2831270369
To Reproduce Steps to reproduce the behavior:
Run deepspeed.initialize() with 2 x 5090 GPUs
System info (please complete the following information):
- OS: ubuntu 24.04
- GPU count and types: 1 machine with 2 x 5090s
- Python version: 3.12
- Any other relevant info about your setup: All latest Nvidia drivers, pytorch nightly etc.
I have the same issue. Just with an 5090 + 4090. I can run training on each card individually. But using both together throws illegal memory access errors.
Hi @Oruli - we don't have any 5090s so we cannot test this but I do not see this on a machine with 2 A6000s. Could you perhaps share your DeepSpeed and cuda version as well here? As well as just the simple script you are running?
Hi @Oruli - we don't have any 5090s so we cannot test this but I do not see this on a machine with 2 A6000s. Could you perhaps share your DeepSpeed and cuda version as well here? As well as just the simple script you are running?
I would love to help test this so we can get it fixed for me and other users.
I am not on that box right now but I believe I am using the latest of everything, pytorch nightly, CUDA 12.8, as for deepspeed I am not specifying version so again assume it's latest.
It's installed via https://github.com/tdrussell/diffusion-pipe/blob/main/requirements.txt
I am not using my own script, I am running Diffusion Pipe, I posted the developers reply in my OP.
Let me know if you have a script I can run to test and provide logs for etc.
@loadams Still no reply even after my offer to debug and get this fixed? I'm not the only person with 2 x 5090s.
@Oruli - I re-read the thread. Are you still seeing this with the latest DeepSpeed version? Just so we can narrow our search to the current commits.
And can you share a minimal repro for the issue, since I don't see one on the linked issue. That way we can look at the code to be sure and see the launcher you're using as well as if deepspeed.init_distributed() is required here.
@Oruli - I re-read the thread. Are you still seeing this with the latest DeepSpeed version? Just so we can narrow our search to the current commits.
And can you share a minimal repro for the issue, since I don't see one on the linked issue. That way we can look at the code to be sure and see the launcher you're using as well as if deepspeed.init_distributed() is required here.
Yes it's the same, and I already answered the question about code, I don't write and I'm not using my own code.
I am using the diffusion-pipe repo, that I've linked and even provided the developers response that it relates to your code, I'm not sure what else you want me to provide at this point.
@Oruli - can you please run a deepspeed example that uses multiple GPUs to see if it is the DeepSpeed integration in the other library or related to DeepSpeed and 5090s specifically?
Any training script should do, lets test with the DeepSpeed launcher:
deepspeed --num_gpus=2 your_script.py --deepspeed_config ds_config.json
Dummy ds_config:
{
"train_batch_size": 4,
"gradient_accumulation_steps": 1,
"fp16": {
"enabled": false
},
"zero_optimization": {
"stage": 0
}
}
Dummy script:
# test_deepspeed_two_gpus.py
import torch
import deepspeed
import argparse
def parse_args():
parser = argparse.ArgumentParser()
parser = deepspeed.add_config_arguments(parser)
parser.add_argument('--deepspeed_config', type=str, default='ds_config.json')
return parser.parse_args()
def main():
args = parse_args()
# Dummy model for testing
model = torch.nn.Linear(10, 10)
# Initialize DeepSpeed
model_engine, _, _, _ = deepspeed.initialize(
args=args,
model=model,
model_parameters=model.parameters()
)
# Print which GPU the model is on
print(f"Model is on device: {model_engine.device}")
# Dummy input
input_data = torch.randn(4, 10).to(model_engine.device)
output = model_engine(input_data)
print(f"Output shape: {output.shape}")
if __name__ == "__main__":
main()
@loadams Thanks for the help.
So to be clear my setup is using Conda, I'm installing as follows to be able to run your script:
conda create -n deepspeed python=3.12
conda activate diffusion-pipe
conda install pip
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
conda install nvidia/label/cuda-12.8.1::cuda-nvcc
pip install --upgrade pip wheel setuptools
pip install deepspeed
result: (currently getting error running the command you gave)
deepspeed --num_gpus=2 your_script.py --deepspeed_config ds_config.json
argparse.ArgumentError: argument --deepspeed_config: conflicting option string: --deepspeed_config
@Oruli, I noticed in your OP that the failure occurs during a send/recv operation. Can you also try the p2p tests in the communication benchmark suite? https://github.com/deepspeedai/DeepSpeedExamples/tree/master/benchmarks/communication
I'd like to chime in - I attempted to use 2x5090s with diffusion pipe as well, on a vast.ai instance. I fail with the same error. 1x 5090 on the same machine will work without issue. 2x 5070tis worked flawlessly on a different instance.
@loadams this is still an issue, I would really like to use the other 5090 i paid for, any chance we can get it resolved?
@Oruli, I noticed in your OP that the failure occurs during a
send/recvoperation. Can you also try the p2p tests in the communication benchmark suite? https://github.com/deepspeedai/DeepSpeedExamples/tree/master/benchmarks/communication
As I'm trying to test this in the exact setup I'm using it under (conda), I cannot get past the error I pasted above, may be an issue with the script I was provided
The following test script works for me and successfully runs with 2 GPUs:
import torch
import deepspeed
import argparse
def parse_args():
parser = argparse.ArgumentParser()
parser = deepspeed.add_config_arguments(parser)
parser.add_argument('--local_rank', type=int, default=-1, help='local rank passed from distributed launcher')
return parser.parse_args()
def main():
args = parse_args()
# Dummy model for testing
model = torch.nn.Linear(10, 10)
# Initialize DeepSpeed
model_engine, _, _, _ = deepspeed.initialize(
args=args,
model=model,
model_parameters=model.parameters()
)
# Print which GPU the model is on
print(f"Model is on device: {model_engine.device}")
# Dummy input
input_data = torch.randn(4, 10).to(model_engine.device)
output = model_engine(input_data)
print(f"Output shape: {output.shape}")
if __name__ == "__main__":
main()
Running using:
deepspeed --num_gpus=2 test.py --deepspeed_config ds_config.json
@Oruli - are you able to use the script that @tdrussell listed above?
Can you also ensure you are on the latest DeepSpeed version?
@loadams @tdrussell thank you for the help. Here is the output:
deepspeed --num_gpus=2 test.py --deepspeed_config ds_config.json
[2025-07-17 09:15:53,939] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-07-17 09:15:56,497] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-07-17 09:15:56,497] [INFO] [runner.py:605:main] cmd = /home/r/miniconda3/envs/diffusion-pipe/bin/python3.12 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test.py --deepspeed_config ds_config.json
[2025-07-17 09:15:57,352] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-07-17 09:15:59,178] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2025-07-17 09:15:59,178] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2025-07-17 09:15:59,178] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2025-07-17 09:15:59,178] [INFO] [launch.py:164:main] dist_world_size=2
[2025-07-17 09:15:59,178] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2025-07-17 09:15:59,179] [INFO] [launch.py:256:main] process 7499 spawned with command: ['/home/r/miniconda3/envs/diffusion-pipe/bin/python3.12', '-u', 'test.py', '--local_rank=0', '--deepspeed_config', 'ds_config.json']
[2025-07-17 09:15:59,180] [INFO] [launch.py:256:main] process 7500 spawned with command: ['/home/r/miniconda3/envs/diffusion-pipe/bin/python3.12', '-u', 'test.py', '--local_rank=1', '--deepspeed_config', 'ds_config.json']
[2025-07-17 09:16:00,042] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-07-17 09:16:00,101] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-07-17 09:16:01,946] [INFO] [logging.py:107:log_dist] [Rank -1] DeepSpeed info: version=0.17.0, git-hash=unknown, git-branch=unknown
[2025-07-17 09:16:01,946] [INFO] [comm.py:675:init_distributed] cdb=None
[2025-07-17 09:16:01,980] [INFO] [logging.py:107:log_dist] [Rank -1] DeepSpeed info: version=0.17.0, git-hash=unknown, git-branch=unknown
[2025-07-17 09:16:01,980] [INFO] [comm.py:675:init_distributed] cdb=None
[2025-07-17 09:16:01,980] [INFO] [comm.py:706:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/r/ai/train/diffusion-pipe/test.py", line 33, in <module>
[rank1]: main()
[rank1]: File "/home/r/ai/train/diffusion-pipe/test.py", line 18, in main
[rank1]: model_engine, _, _, _ = deepspeed.initialize(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/deepspeed/__init__.py", line 179, in initialize
[rank1]: config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/deepspeed/runtime/config.py", line 718, in __init__
[rank1]: config_decoded = base64.urlsafe_b64decode(config).decode('utf-8')
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/base64.py", line 134, in urlsafe_b64decode
[rank1]: return b64decode(s)
[rank1]: ^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/base64.py", line 88, in b64decode
[rank1]: return binascii.a2b_base64(s, strict_mode=validate)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: binascii.Error: Invalid base64-encoded string: number of data characters (13) cannot be 1 more than a multiple of 4
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/r/ai/train/diffusion-pipe/test.py", line 33, in <module>
[rank0]: main()
[rank0]: File "/home/r/ai/train/diffusion-pipe/test.py", line 18, in main
[rank0]: model_engine, _, _, _ = deepspeed.initialize(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/deepspeed/__init__.py", line 179, in initialize
[rank0]: config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/deepspeed/runtime/config.py", line 718, in __init__
[rank0]: config_decoded = base64.urlsafe_b64decode(config).decode('utf-8')
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/base64.py", line 134, in urlsafe_b64decode
[rank0]: return b64decode(s)
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/base64.py", line 88, in b64decode
[rank0]: return binascii.a2b_base64(s, strict_mode=validate)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: binascii.Error: Invalid base64-encoded string: number of data characters (13) cannot be 1 more than a multiple of 4
[rank0]:[W717 09:16:03.903621983 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-07-17 09:16:05,181] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 7499
[2025-07-17 09:16:05,194] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 7500
[2025-07-17 09:16:05,194] [ERROR] [launch.py:325:sigkill_handler] ['/home/r/miniconda3/envs/diffusion-pipe/bin/python3.12', '-u', 'test.py', '--local_rank=1', '--deepspeed_config', 'ds_config.json'] exits with return code = 1
@Oruli - do you have a deepspeed config in that folder? I think the default of not having one is what is causing this error.
@Oruli - do you have a deepspeed config in that folder? I think the default of not having one is what is causing this error.
@tdrussell I'm running your script inside the root of diffusion pipe as that's how/where I'm getting the original issue. Could you provide a ds_config.json that I should be using to replicate this?
@Oruli - you should be able to use any ds_config.json, you'll just need one. This page goes into detail on them:
https://www.deepspeed.ai/docs/config-json/
This is a fairly reduced ds_config
{
"train_batch_size": 1,
"train_micro_batch_size_per_gpu": 1,
"fp16": {
"enabled": true
}
}
@loadams @tdrussell
Error below, different to the above one I posted as I've now included your ds_config.json
[2025-07-28 11:04:04,813] [INFO] [config.py:944:print_user_config] json = {
"train_batch_size": 8,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 4,
"fp16": {
"enabled": true
}
}
Model is on device: cuda:0
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/r/ai/train/diffusion-pipe/test.py", line 33, in <module>
[rank1]: main()
[rank1]: File "/home/r/ai/train/diffusion-pipe/test.py", line 29, in main
[rank1]: output = model_engine(input_data)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2105, in forward
[rank1]: loss = self.module(*inputs, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank1]: return inner()
[rank1]: ^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1805, in inner
[rank1]: result = forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/linear.py", line 125, in forward
[rank1]: return F.linear(input, self.weight, self.bias)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/r/ai/train/diffusion-pipe/test.py", line 33, in <module>
[rank0]: main()
[rank0]: File "/home/r/ai/train/diffusion-pipe/test.py", line 29, in main
[rank0]: output = model_engine(input_data)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2105, in forward
[rank0]: loss = self.module(*inputs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank0]: return inner()
[rank0]: ^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1805, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/r/miniconda3/envs/diffusion-pipe/lib/python3.12/site-packages/torch/nn/modules/linear.py", line 125, in forward
[rank0]: return F.linear(input, self.weight, self.bias)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half
[rank0]:[W728 11:04:05.965386272 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-07-28 11:04:06,702] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 41986
[2025-07-28 11:04:06,702] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 41987
[2025-07-28 11:04:06,804] [ERROR] [launch.py:325:sigkill_handler] ['/home/r/miniconda3/envs/diffusion-pipe/bin/python3.12', '-u', 'test.py', '--local_rank=1', '--deepspeed_config', 'ds_config.json'] exits with return code = 1
With @tdrussell 's updated script from here: https://github.com/tdrussell/diffusion-pipe/issues/235
I get this output
[2025-07-28 11:08:30,568] [INFO] [config.py:944:print_user_config] json = {
"train_batch_size": 8,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 4,
"fp16": {
"enabled": true
}
}
[2025-07-28 11:08:30,568] [INFO] [engine.py:105:__init__] CONFIG: micro_batches=4 micro_batch_size=1
[2025-07-28 11:08:30,568] [INFO] [engine.py:146:__init__] is_pipe_partitioned= False is_grad_partitioned= False
[2025-07-28 11:08:30,585] [INFO] [engine.py:165:__init__] RANK=0 STAGE=0 LAYERS=1 [0, 1) STAGE_PARAMS=110 (0.000M) TOTAL_PARAMS=110 (0.000M) UNIQUE_PARAMS=110 (0.000M)
Model is on device: cuda:0
Model is on device: cuda:1
[rank0]:[W728 11:08:31.959431767 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank1]:[W728 11:08:31.292704504 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-07-28 11:08:33,812] [INFO] [launch.py:351:main] Process 43830 exits successfully.
[2025-07-28 11:08:33,812] [INFO] [launch.py:351:main] Process 43829 exits successfully.
@loadams I've provided all the info you asked for, what else do you need?
The discussing spilled over to the original thread but the developer cannot test any further: https://github.com/tdrussell/diffusion-pipe/issues/235
@Oruli - this recent one looks to be working (agree with this comment). So DeepSpeed is able to load the model onto the GPU just fine.
I'm not able to reproduce this on A6000s. But could you try adding things back to the script to get to the smallest that repros the failure?
@Oruli - this recent one looks to be working (agree with this comment). So DeepSpeed is able to load the model onto the GPU just fine.
I'm not able to reproduce this on A6000s. But could you try adding things back to the script to get to the smallest that repros the failure?
@loadams Yes I can do whatever is needed, but I am not a developer an beyond running the diffusion-pipe training script I'm completely lost here. But I can test if you give me the code, and I'm definitely not the only one with this issue, just yesterday someone else replied to that thread:
Colnb83 left a comment (tdrussell/diffusion-pipe#235)
I'm having the same issue and error with 2x RTX 6000, must be a cuda/blackwell issue?
Really would love to get this fixed!!
I am getting the exact same issue training with diffusion-pipe on 5090 cards as soon as I try a multi-gpu setup.
- A single 5090 work fine.
- Multiple 5090 doesn't work.
- Multiple A100 work fine.
- Multiple A6000 work fine.
Exact same error as OP:
ProcessGroupNCCL.cpp:1981] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
You must stop trying to reproduce this on different cards, as you absolutely need 5090 cards to get the issue. I've seen people also reporting the same for RTX PRO 6000 cards, but this one I haven't tried myself.
Have we made any progress on this since last august? A dual 5090 setup costs only 1.78/hour on runpod. If you need one up for couple hours to investigate then I can provide one with the full setup to reproduce.
@ChuckNovice It's been fixed for me for a while now. I thought it was an update here (assuming you've updated everything), but maybe another package.
Make sure you are running latest drivers, and CUDA 13 (I am)