returnn
returnn copied to clipboard
PT: No backend type associated with device type cpu
Some of my trainings had this crash right after startup. I just want to report this for now, I don't have many details of why this happens yet. The trainings are scheduled rather normally on an A5000 GPU node, and nvidia-smi on that node gives sensible results.
EXCEPTION
Traceback (most recent call last):
File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/rnn.py", line 11, in <module>
line: main()
locals:
main = <local> <function main at 0x7fe2f2b4cee0>
File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/__main__.py", line 744, in main
line: execute_main_task()
locals:
execute_main_task = <global> <function execute_main_task at 0x7fe2f2b4cdc0>
File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/__main__.py", line 543, in execute_main_task
line: engine.train()
locals:
engine = <global> <returnn.torch.engine.Engine object at 0x7fe2271e7520>
engine.train = <global> <bound method Engine.train of <returnn.torch.engine.Engine object at 0x7fe2271e7520>>
File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/torch/engine.py", line 256, in Engine.train
line: self.train_epoch()
locals:
self = <local> <returnn.torch.engine.Engine object at 0x7fe2271e7520>
self.train_epoch = <local> <bound method Engine.train_epoch of <returnn.torch.engine.Engine object at 0x7fe2271e7520>>
File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/torch/engine.py", line 387, in Engine.train_epoch
line: torch.distributed.all_reduce(_has_data, op=torch.distributed.ReduceOp.MIN)
locals:
torch = <global> <module 'torch' from '/usr/local/lib/python3.10/dist-packages/torch/__init__.py'>
torch.distributed = <global> <module 'torch.distributed' from '/usr/local/lib/python3.10/dist-packages/torch/distributed/__init__.py'>
torch.distributed.all_reduce = <global> <function all_reduce at 0x7fe226f68550>
_has_data = <local> tensor[1] i8 [1]
op = <not found>
torch.distributed.ReduceOp = <global> <class 'torch.distributed.distributed_c10d.ReduceOp'>
torch.distributed.ReduceOp.MIN = <global> <RedOpType.MIN: 3>
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in init_process_group
line: return func(*args, **kwargs)
locals:
func = <local> <function all_reduce at 0x7fe226f684c0>
args = <local> (tensor[1] i8 [1],)
kwargs = <local> {'op': <RedOpType.MIN: 3>}
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2806, in all_reduce
line: work = group.allreduce([tensor], opts)
locals:
work = <not found>
group = <local> <torch.distributed.distributed_c10d.ProcessGroup object at 0x7fe1e44b66f0>
group.allreduce = <local> <bound method PyCapsule.allreduce of <torch.distributed.distributed_c10d.ProcessGroup object at 0x7fe1e44b66f0>>
tensor = <local> tensor[1] i8 [1]
opts = <local> <torch.distributed.distributed_c10d.AllreduceOptions object at 0x7fe1faa492b0>
RuntimeError: No backend type associated with device type cpu
What PyTorch version?
It's correct that _has_data is on CPU. But there should also be a distributed backend for it? I thought the way that we init PyTorch distributed is for both CPU and CUDA? See this comment:
# when no backend is specified, both gloo and nccl backends will be created
# the gloo backend will be used for collectives with CPU tensors and
# the nccl backend will be used for collectives with CUDA tensors
dist.init_process_group(backend=self._opts.get("backend", None))
Did you maybe set backend to sth specific? Why? What happens if you leave it to None? What happens if you set it to "cpu:gloo,cuda:nccl" or sth like that?
Maybe related:
- https://discuss.pytorch.org/t/mpi-backend-and-gpu-tensor-error/185745
- https://github.com/Lightning-AI/pytorch-lightning/issues/18803
- https://github.com/Lightning-AI/torchmetrics/issues/2477
- https://discuss.huggingface.co/t/bug-on-multi-gpu-trainer-with-accelerate/141592
- https://github.com/zyushun/Adam-mini/issues/28
- (And much more when you search for the error...)
I switched back to torch 2.5 from 2.6 and it did not happen again so far. I will be looking out for this.
I think torch 2.6 somehow causes this. I can make this issue go away if I switch back to torch 2.5 on the same GPU node.
Did you set backend? What are your options? What happens if you set backend to "cpu:gloo,cuda:nccl" or sth like that?
Maybe the init_process_group behavior changed in PyTorch 2.6.
Hi, I'm also having the exact same issue (I actually came here from https://github.com/Lightning-AI/pytorch-lightning/issues/18803 and found this thread there... Github is very small 😄):
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in init_process_group
line: return func(*args, **kwargs)
locals:
func = <local> <function torch.distributed.distributed_c10d.all_reduce>
args = <local> (tensor[1] i8 [1],)
kwargs = <local> {'op': <RedOpType.MIN: 3>}
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2806, in all_reduce
line: work = group.allreduce([tensor], opts)
locals:
group = <local> <torch.distributed.distributed_c10d.ProcessGroup object at 0x7f2703afec70>
group.allreduce = <local> <bound method PyCapsule.allreduce of <torch.distributed.distributed_c10d.ProcessGroup object at 0x7f2703afec70>>
tensor = <local> tensor[1] i8 [1]
opts = <local> <torch.distributed.distributed_c10d.AllreduceOptions object at 0x7f2865c8d130>
RuntimeError: No backend type associated with device type cpu
In my training config I set backend = "torch". If I set backend = "cpu:gloo,cuda:nccl" I get a KeyError in select_engine. I don't think this is really the backend option that you want us to check, is it?
@NeoLegends so you never fixed this yet?
If I set
backend = "cpu:gloo,cuda:nccl"
I meant within the distrib options.
@albertz I tested your suggestion(backend = "cpu:gloo,cuda:nccl"), there is a bug in torch/distributed.py but I'm now able to run it after fixing the bug. I will make a PR for RETURNN.
Why do you close this? This is not closed, unless it is fixed.
I tested your suggestion(backend = "cpu:gloo,cuda:nccl"), there is a bug in torch/distributed.py but I'm now able to run it after fixing the bug. I will make a PR for RETURNN.
I'm not sure I understand. backend = "cpu:gloo,cuda:nccl" works or not? Fixing what bug exactly?
@Stefanwuu Does the error go away for you if you set this parameter via your RETURNN config? If so, please file another PR that sets backend = "cpu:gloo,cuda:nccl" as default (so that you don't have to set it to that value via the config to have distributed training work at all). Only then this issue is fixed.
Sorry for confusion, I closed this because it can indeed be fixed by setting e.g. torch_distributed = {"param_sync_step": 100, "reduce_type": "param", "backend": "cpu:gloo,cuda:nccl"} in returnn config. But as Moritz said, it would be better to set that as a default.
But as Moritz said, it would be better to set that as a default.
Yes at least that. That's maybe not enough. The behavior changed here in some PyTorch version (which exactly? 2.6?) And I think we would want that old configs still work with new PyTorch, without a need to change the config (because changing the config would change the hash, etc). So if the user just had "backend": "nccl" before, this breaks now? Or maybe that's not relevant because no-one ever has set backend explicitly anyway so far?
One way for example, I'm not sure if this makes sense or is easy to do: After torch distribute init, check whether there is a backend type associated with device type cpu, and if not, set that up using gloo (but only for cpu, and keep other installed backends).
I think setting a new default would be sufficient because there was that bug with explicitly setting backend but nobody fixed it, I assume we always used default None to start both nccl and gloo. Would you agree with simply changing the default for backend? @albertz
I assume we always used default None to start both nccl and gloo.
That was exactly my question. Is this the case? Maybe ask around whether someone has used backend. But maybe you are right and no-one has set backend explicitly so far.
Would you agree with simply changing the default for backend?
Well, that anyway, there is no question on that. You can do a PR for this.
I'm just wondering if we should do more to it. But maybe that's enough for now.