DeepSpeed
DeepSpeed copied to clipboard
Add function `to_sequential` to PipelineModule
In https://github.com/EleutherAI/gpt-neox we were previously maintaining two separate models - one if the user wanted to use pipeline parallel, and one if they didn't.
The more straightforward solution was to add a to_sequential function to export the PipelineModule as an nn.Sequential model, so we could train with deepspeed features that aren't compatible with pipe parallel (i.e ZeRO 2+).
Figure this might be a useful addition to the base module, too. I'm not 100% sure if the support for tied layers here is as flexible as it could / should be, since their capabilities are not very well documented, but it works at least for our purposes (with tied Embeddings as the output layer).
In addition to to_sequential, there may be another way we could accomplish this while keeping the normal PipelineModule, if that would be useful.
If we short-circuit this condition and use the regular training engine, I think that PipelineModule should behave as a normal torch.nn.Module and you can use ZeRO-2, etc. I intended for that to be the case, but not tested these days.
https://github.com/microsoft/DeepSpeed/blob/dad26428e3f28898b8d0f5ace1b3df3e6db8f8e8/deepspeed/init.py#L119-L120
In addition to
to_sequential, there may be another way we could accomplish this while keeping the normalPipelineModule, if that would be useful.If we short-circuit this condition and use the regular training engine, I think that
PipelineModuleshould behave as a normaltorch.nn.Moduleand you can use ZeRO-2, etc. I intended for that to be the case, but not tested these days.https://github.com/microsoft/DeepSpeed/blob/dad26428e3f28898b8d0f5ace1b3df3e6db8f8e8/deepspeed/init.py#L119-L120
Hi @ShadenSmith , I actually tried this as well - and it seems this way of doing things drops any tied modules (since the pipe engine handles them specially.) So for example, if we used this with a model with tied embeddings, the to_logits function that uses the word embedding weights would just get silently dropped.
This is a great idea, thanks @sdtblck !
One caveat is that we lose the activation checkpointing that the
PipelineModule's forward can be configured to use. But users can instead use torch'scheckpoint_sequential()if they want checkpointing. Or we could wrap the layers in a similar way asLambdaif we really want to mirror functionality. What are your thoughts?
Hm. Yeah this is a good point that I had overlooked. I'll spend some time looking into the best way to get this working today.
Hi @ShadenSmith
I think the two latest commits should fix both the above requirements. There is maybe some repeated code between SequentialModel and PipelineModule that could be slimmed down - but I have tested with gpt-neox and it works well.
Can one of the admins verify this patch?
@sdtblck - just fixed some formatting issues that were preventing this - if the tests pass, would this be good to merge now?