Ke Wen
Ke Wen
Sorry for replying late. We have migrated the PiPPy library to [`torch.distributed.pipelining`](https://github.com/pytorch/pytorch/tree/main/torch/distributed/pipelining) Here is our new documentation: https://pytorch.org/docs/main/distributed.pipelining.html. In section "Option 2", you can see: > The Pipe object provides...
Hmm, do you mean getting back the full model at the end of training, but before saving the final checkpoint? It might be hard, I think, because each stage's updated...
Thanks for making it work! Quick comment: Do you mind creating a dedicated example for DCP + PP? You can copy the model out (we plan to build a "model...
What's our plan for this PR? @LucasLLC I think we are pretty close to the destination. Would the following next steps be reasonable? 1. Move the example to `examples/checkpoint`, and...
For code quality checks, please run: ``` ./format.sh ./check.sh ```
Documenting my discussion with @wanchaol wrt to DTensor and `scaled_dot_product_attention`: @kwen2501 : Should we do to_local as soon as we did colwise, or should we do to_local when we hit...
Hi @leiwen83, that's an interesting question. I think at the Zero-2 stage (where the gradients are sharded), there would need to be some special arrangement: As each micro-batch runs their...
Yes, please! You are so much welcome to pull a PR!
Is the numeric difference seen in backward only or in forward too?
The original plan was for tracer's stage and manual's stage be in different files (files are more 1:1 mapped with classes back then), and the base be on the manual...