rfcs
rfcs copied to clipboard
RFC-0020/0021/0022 RFCs for Pipeline Parallelism
This PR consists of three RFCs:
- RFC-0020 Pipeline Parallelism Strategic Plan
- RFC-0021 Pipeline Parallelism Technical Approach Proposal
- RFC-0022 Model Partitioning in Pipeline Parallelism Proposal
Please note that the details of these proposals are subject to revision given feedback from users and partners. Please feel free to comment on the RFCs with your feedback
Thanks for this, @jamesr66a! It is quite detailed; it will take me a few more days to give it the full attention it deserves.
Some very quick high-level thoughts:
- I prefer figuring out a clean API to support pipeline parallelism for any model, rather than first trying to support
torch.nn.Sequential
. Things like skip connections are pretty common, so figuring out how to support these now rather than later seems like time well spent. - I also think supporting this with as few code changes as possible to the main training loop is good, similar to
torch.distributed.DistributedDataParallel
. Of course, some things will probably need to change given that every rank is not loading input data, etc. - I like the idea of hiding the specifics of what happens in a given batch (e.g., how many microbatches are in a batch, how these microbatches are scheduled, etc.) behind some API. IMO this achieves a clean separation of concerns: users can use free-form Python for the per-batch processing (as before), while hiding parallelization-strategy-specific implementation details behind an API. And the API can provide some way for users to override the schedule if they like. I wonder if it's possible to support asynchronous pipelining schemes using this approach.
Oh, another thing that I don't see mentioned anywhere: it is common for language models to have layers that "share" parameters. For example, the embedding layer at the front of the model and the LM head often share weights. There are a couple of different ways of handling this: a) ensure that all layers sharing parameters are on the same device, and let autograd do the right thing, b) allow layers sharing parameters to reside on different devices, but then synchronize gradients at the end of a step (and also initialize parameters the same way on all devices with the same "shared" parameters).
I am sure there are other ways of handling this as well.
This is a fantastic summary of attempts for PP, great work @jamesr66a :)
Within Lightning we experimented with the initial RemoteModule
from Fairscale which relied on RPC, however as discussed the requirement of everything being sequential made it very restrictive.
There is another example (albeit very intrusive in the user's code) by the GraphCore team where PP is a requirement for many models when using IPUs https://docs.graphcore.ai/projects/poptorch-user-guide/en/1.0.0/overview.html#poptorch-block-and-poptorch-beginblock
Even though this approach is extremely intrusive, it does have its merits of support a wider range of models + being a bit more expressive. I would relate this to the idea behind FlexFlow's torch.fx support as we would need to traverse the graph.
Even though this approach is extremely intrusive, it does have its merits of support a wider range of models + being a bit more expressive. I would relate this to the idea behind FlexFlow's torch.fx support as we would need to traverse the graph.
Indeed, I was considering mentioning https://github.com/flexflow/flexflow as another framework to consider but last we checked it still had the PP-support only planned.
The design paper is interesting and it too uses a simulation to automatically partition the graph.
cc @pritamdamania87 @pbelevich @mrshenli @zhaojuanmao
Hi @jamesr66a!
Thank you for your pull request.
We require contributors to sign our Contributor License Agreement, and yours needs attention.
You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed
. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!