Albert Zeyer
Albert Zeyer
I also would like to have this. I was really confused that this is not yet supported and thought that I must have done sth wrong. I really wonder why...
What PyTorch version? It's correct that `_has_data` is on CPU. But there should also be a distributed backend for it? I thought the way that we init PyTorch distributed is...
Did you set `backend`? What are your options? What happens if you set `backend` to `"cpu:gloo,cuda:nccl"` or sth like that? Maybe the `init_process_group` behavior changed in PyTorch 2.6.
@NeoLegends so you never fixed this yet? > If I set `backend = "cpu:gloo,cuda:nccl"` I meant within the distrib options.
Why do you close this? This is not closed, unless it is fixed.
> I tested your suggestion(backend = "cpu:gloo,cuda:nccl"), there is a bug in torch/distributed.py but I'm now able to run it after fixing the bug. I will make a PR for...
> But as Moritz said, it would be better to set that as a default. Yes at least that. That's maybe not enough. The behavior changed here in some PyTorch...
One way for example, I'm not sure if this makes sense or is easy to do: After torch distribute init, check whether there is a backend type associated with device...
> I assume we always used default None to start both nccl and gloo. That was exactly my question. Is this the case? Maybe ask around whether someone has used...
(The torch.distributed output is somehow messed up. Do you have a version which is not messed up?)