Ke Wen comments

Results 65 comments of


                                            Ke Wen

[PGNCCL] Expose out-of-place _broadcast_oop to c10d

Closing this as a stale PR. `_broadcast_oop` has been implemented as a function internal to `ProcessGroupNCCL` in PR #83713, instead of exposed as c10d user API.

Expose an out-of-place _reduce_oop from ProcessGroupNCCL

Maybe also worth writing a bit in the PR description about the context -- You need out-of-place reduce from the backend because you want to compose a reduce_scatter_v pattern at...

Expose an out-of-place _reduce_oop from ProcessGroupNCCL

Also good to comment the above in code.

Expose an out-of-place _reduce_oop from ProcessGroupNCCL

@pytorchbot merge

Non-0 rank creates CUDA context on GPU 0

Looks like a general pytorch issue: Running a simple DDP program: ![Screenshot 2024-03-06 at 12 09 42 PM](https://github.com/pytorch/PiPPy/assets/6676466/38bf08b6-2dc7-4d57-b932-9c0ee670574e)

[Distributed] Add P2P versions of *object_list operations

You can rebase your PR onto viable/strict and see if the failures go away.

Updated example_train.py hangs on CPU training

Hi, thanks for reporting it. That's a known issue. The CPU communication backend (Gloo) does not have a good support for `batch_isend_irecv`, which we recently move to to communication multiple...

Unexpected Memory Usage and Latency with PP

Hi, on latency, if you measure the first iteration, it will include the distributed initialization time (e.g. NCCL communicator initialization). You can try give it some warm-up runs and then...

Unexpected Memory Usage and Latency with PP

On `examples/cpu_init/gpt2_cpu_init.py`, I couldn't repro the error, whether with 2 ranks or 4 ranks. Are you using the llama model with cpu init?

Unexpected Memory Usage and Latency with PP

On memory consumption, it is expected to be high if you initialize the model on real device. We are actively developing technique to support creating initial model on meta device:...