returnn icon indicating copy to clipboard operation
returnn copied to clipboard

PT: No backend type associated with device type cpu

Open NeoLegends opened this issue 6 months ago • 4 comments

Some of my trainings had this crash right after startup. I just want to report this for now, I don't have many details of why this happens yet. The trainings are scheduled rather normally on an A5000 GPU node, and nvidia-smi on that node gives sensible results.

EXCEPTION
Traceback (most recent call last):
  File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/rnn.py", line 11, in <module>
    line: main()
    locals:
      main = <local> <function main at 0x7fe2f2b4cee0>
  File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/__main__.py", line 744, in main
    line: execute_main_task()
    locals:
      execute_main_task = <global> <function execute_main_task at 0x7fe2f2b4cdc0>
  File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/__main__.py", line 543, in execute_main_task
    line: engine.train()
    locals:
      engine = <global> <returnn.torch.engine.Engine object at 0x7fe2271e7520>
      engine.train = <global> <bound method Engine.train of <returnn.torch.engine.Engine object at 0x7fe2271e7520>>
  File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/torch/engine.py", line 256, in Engine.train
    line: self.train_epoch()
    locals:
      self = <local> <returnn.torch.engine.Engine object at 0x7fe2271e7520>
      self.train_epoch = <local> <bound method Engine.train_epoch of <returnn.torch.engine.Engine object at 0x7fe2271e7520>>
  File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/torch/engine.py", line 387, in Engine.train_epoch
    line: torch.distributed.all_reduce(_has_data, op=torch.distributed.ReduceOp.MIN)
    locals:
      torch = <global> <module 'torch' from '/usr/local/lib/python3.10/dist-packages/torch/__init__.py'>
      torch.distributed = <global> <module 'torch.distributed' from '/usr/local/lib/python3.10/dist-packages/torch/distributed/__init__.py'>
      torch.distributed.all_reduce = <global> <function all_reduce at 0x7fe226f68550>
      _has_data = <local> tensor[1] i8 [1]
      op = <not found>
      torch.distributed.ReduceOp = <global> <class 'torch.distributed.distributed_c10d.ReduceOp'>
      torch.distributed.ReduceOp.MIN = <global> <RedOpType.MIN: 3>
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in init_process_group
    line: return func(*args, **kwargs)
    locals:
      func = <local> <function all_reduce at 0x7fe226f684c0>
      args = <local> (tensor[1] i8 [1],)
      kwargs = <local> {'op': <RedOpType.MIN: 3>}
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2806, in all_reduce
    line: work = group.allreduce([tensor], opts)
    locals:
      work = <not found>
      group = <local> <torch.distributed.distributed_c10d.ProcessGroup object at 0x7fe1e44b66f0>
      group.allreduce = <local> <bound method PyCapsule.allreduce of <torch.distributed.distributed_c10d.ProcessGroup object at 0x7fe1e44b66f0>>
      tensor = <local> tensor[1] i8 [1]
      opts = <local> <torch.distributed.distributed_c10d.AllreduceOptions object at 0x7fe1faa492b0>
RuntimeError: No backend type associated with device type cpu

NeoLegends avatar May 14 '25 14:05 NeoLegends

What PyTorch version?

It's correct that _has_data is on CPU. But there should also be a distributed backend for it? I thought the way that we init PyTorch distributed is for both CPU and CUDA? See this comment:

# when no backend is specified, both gloo and nccl backends will be created
# the gloo backend will be used for collectives with CPU tensors and
# the nccl backend will be used for collectives with CUDA tensors
dist.init_process_group(backend=self._opts.get("backend", None))

Did you maybe set backend to sth specific? Why? What happens if you leave it to None? What happens if you set it to "cpu:gloo,cuda:nccl" or sth like that?

Maybe related:

  • https://discuss.pytorch.org/t/mpi-backend-and-gpu-tensor-error/185745
  • https://github.com/Lightning-AI/pytorch-lightning/issues/18803
  • https://github.com/Lightning-AI/torchmetrics/issues/2477
  • https://discuss.huggingface.co/t/bug-on-multi-gpu-trainer-with-accelerate/141592
  • https://github.com/zyushun/Adam-mini/issues/28
  • (And much more when you search for the error...)

albertz avatar May 14 '25 20:05 albertz

I switched back to torch 2.5 from 2.6 and it did not happen again so far. I will be looking out for this.

NeoLegends avatar May 16 '25 13:05 NeoLegends

I think torch 2.6 somehow causes this. I can make this issue go away if I switch back to torch 2.5 on the same GPU node.

NeoLegends avatar May 20 '25 09:05 NeoLegends

Did you set backend? What are your options? What happens if you set backend to "cpu:gloo,cuda:nccl" or sth like that?

Maybe the init_process_group behavior changed in PyTorch 2.6.

albertz avatar May 20 '25 10:05 albertz

Hi, I'm also having the exact same issue (I actually came here from https://github.com/Lightning-AI/pytorch-lightning/issues/18803 and found this thread there... Github is very small 😄):

  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in init_process_group
    line: return func(*args, **kwargs)
    locals:
      func = <local> <function torch.distributed.distributed_c10d.all_reduce>
      args = <local> (tensor[1] i8 [1],)
      kwargs = <local> {'op': <RedOpType.MIN: 3>}
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2806, in all_reduce
    line: work = group.allreduce([tensor], opts)
    locals:
      group = <local> <torch.distributed.distributed_c10d.ProcessGroup object at 0x7f2703afec70>
      group.allreduce = <local> <bound method PyCapsule.allreduce of <torch.distributed.distributed_c10d.ProcessGroup object at 0x7f2703afec70>>
      tensor = <local> tensor[1] i8 [1]
      opts = <local> <torch.distributed.distributed_c10d.AllreduceOptions object at 0x7f2865c8d130>
RuntimeError: No backend type associated with device type cpu

In my training config I set backend = "torch". If I set backend = "cpu:gloo,cuda:nccl" I get a KeyError in select_engine. I don't think this is really the backend option that you want us to check, is it?

Icemole avatar Jul 10 '25 13:07 Icemole

@NeoLegends so you never fixed this yet?

If I set backend = "cpu:gloo,cuda:nccl"

I meant within the distrib options.

albertz avatar Jul 10 '25 15:07 albertz

@albertz I tested your suggestion(backend = "cpu:gloo,cuda:nccl"), there is a bug in torch/distributed.py but I'm now able to run it after fixing the bug. I will make a PR for RETURNN.

Stefanwuu avatar Jul 18 '25 09:07 Stefanwuu

Why do you close this? This is not closed, unless it is fixed.

albertz avatar Jul 18 '25 09:07 albertz

I tested your suggestion(backend = "cpu:gloo,cuda:nccl"), there is a bug in torch/distributed.py but I'm now able to run it after fixing the bug. I will make a PR for RETURNN.

I'm not sure I understand. backend = "cpu:gloo,cuda:nccl" works or not? Fixing what bug exactly?

albertz avatar Jul 18 '25 09:07 albertz

@Stefanwuu Does the error go away for you if you set this parameter via your RETURNN config? If so, please file another PR that sets backend = "cpu:gloo,cuda:nccl" as default (so that you don't have to set it to that value via the config to have distributed training work at all). Only then this issue is fixed.

NeoLegends avatar Jul 18 '25 09:07 NeoLegends

Sorry for confusion, I closed this because it can indeed be fixed by setting e.g. torch_distributed = {"param_sync_step": 100, "reduce_type": "param", "backend": "cpu:gloo,cuda:nccl"} in returnn config. But as Moritz said, it would be better to set that as a default.

Stefanwuu avatar Jul 18 '25 09:07 Stefanwuu

But as Moritz said, it would be better to set that as a default.

Yes at least that. That's maybe not enough. The behavior changed here in some PyTorch version (which exactly? 2.6?) And I think we would want that old configs still work with new PyTorch, without a need to change the config (because changing the config would change the hash, etc). So if the user just had "backend": "nccl" before, this breaks now? Or maybe that's not relevant because no-one ever has set backend explicitly anyway so far?

albertz avatar Jul 18 '25 10:07 albertz

One way for example, I'm not sure if this makes sense or is easy to do: After torch distribute init, check whether there is a backend type associated with device type cpu, and if not, set that up using gloo (but only for cpu, and keep other installed backends).

albertz avatar Jul 18 '25 10:07 albertz

I think setting a new default would be sufficient because there was that bug with explicitly setting backend but nobody fixed it, I assume we always used default None to start both nccl and gloo. Would you agree with simply changing the default for backend? @albertz

Stefanwuu avatar Jul 18 '25 10:07 Stefanwuu

I assume we always used default None to start both nccl and gloo.

That was exactly my question. Is this the case? Maybe ask around whether someone has used backend. But maybe you are right and no-one has set backend explicitly so far.

Would you agree with simply changing the default for backend?

Well, that anyway, there is no question on that. You can do a PR for this.

I'm just wondering if we should do more to it. But maybe that's enough for now.

albertz avatar Jul 18 '25 10:07 albertz