Ke Wen
Ke Wen
Closing this as a stale PR. `_broadcast_oop` has been implemented as a function internal to `ProcessGroupNCCL` in PR #83713, instead of exposed as c10d user API.
Maybe also worth writing a bit in the PR description about the context -- You need out-of-place reduce from the backend because you want to compose a reduce_scatter_v pattern at...
Also good to comment the above in code.
@pytorchbot merge
Looks like a general pytorch issue: Running a simple DDP program: 
You can rebase your PR onto viable/strict and see if the failures go away.
Hi, thanks for reporting it. That's a known issue. The CPU communication backend (Gloo) does not have a good support for `batch_isend_irecv`, which we recently move to to communication multiple...
Hi, on latency, if you measure the first iteration, it will include the distributed initialization time (e.g. NCCL communicator initialization). You can try give it some warm-up runs and then...
On `examples/cpu_init/gpt2_cpu_init.py`, I couldn't repro the error, whether with 2 ranks or 4 ranks. Are you using the llama model with cpu init?
On memory consumption, it is expected to be high if you initialize the model on real device. We are actively developing technique to support creating initial model on meta device:...