DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Add function `to_sequential` to PipelineModule

Open sdtblck opened this issue 4 years ago • 5 comments

In https://github.com/EleutherAI/gpt-neox we were previously maintaining two separate models - one if the user wanted to use pipeline parallel, and one if they didn't.

The more straightforward solution was to add a to_sequential function to export the PipelineModule as an nn.Sequential model, so we could train with deepspeed features that aren't compatible with pipe parallel (i.e ZeRO 2+).

Figure this might be a useful addition to the base module, too. I'm not 100% sure if the support for tied layers here is as flexible as it could / should be, since their capabilities are not very well documented, but it works at least for our purposes (with tied Embeddings as the output layer).

sdtblck avatar Apr 28 '21 21:04 sdtblck

In addition to to_sequential, there may be another way we could accomplish this while keeping the normal PipelineModule, if that would be useful.

If we short-circuit this condition and use the regular training engine, I think that PipelineModule should behave as a normal torch.nn.Module and you can use ZeRO-2, etc. I intended for that to be the case, but not tested these days.

https://github.com/microsoft/DeepSpeed/blob/dad26428e3f28898b8d0f5ace1b3df3e6db8f8e8/deepspeed/init.py#L119-L120

ShadenSmith avatar Apr 30 '21 03:04 ShadenSmith

In addition to to_sequential, there may be another way we could accomplish this while keeping the normal PipelineModule, if that would be useful.

If we short-circuit this condition and use the regular training engine, I think that PipelineModule should behave as a normal torch.nn.Module and you can use ZeRO-2, etc. I intended for that to be the case, but not tested these days.

https://github.com/microsoft/DeepSpeed/blob/dad26428e3f28898b8d0f5ace1b3df3e6db8f8e8/deepspeed/init.py#L119-L120

Hi @ShadenSmith , I actually tried this as well - and it seems this way of doing things drops any tied modules (since the pipe engine handles them specially.) So for example, if we used this with a model with tied embeddings, the to_logits function that uses the word embedding weights would just get silently dropped.

sdtblck avatar Apr 30 '21 12:04 sdtblck

This is a great idea, thanks @sdtblck !

One caveat is that we lose the activation checkpointing that the PipelineModule's forward can be configured to use. But users can instead use torch's checkpoint_sequential() if they want checkpointing. Or we could wrap the layers in a similar way as Lambda if we really want to mirror functionality. What are your thoughts?

Hm. Yeah this is a good point that I had overlooked. I'll spend some time looking into the best way to get this working today.

sdtblck avatar Apr 30 '21 12:04 sdtblck

Hi @ShadenSmith

I think the two latest commits should fix both the above requirements. There is maybe some repeated code between SequentialModel and PipelineModule that could be slimmed down - but I have tested with gpt-neox and it works well.

sdtblck avatar Apr 30 '21 14:04 sdtblck

Can one of the admins verify this patch?

rocm-mici avatar Jun 09 '22 20:06 rocm-mici

@sdtblck - just fixed some formatting issues that were preventing this - if the tests pass, would this be good to merge now?

loadams avatar Nov 14 '23 23:11 loadams