DeepSpeed Add function `to_sequential` to PipelineModule

trafficstars

In https://github.com/EleutherAI/gpt-neox we were previously maintaining two separate models - one if the user wanted to use pipeline parallel, and one if they didn't.

The more straightforward solution was to add a to_sequential function to export the PipelineModule as an nn.Sequential model, so we could train with deepspeed features that aren't compatible with pipe parallel (i.e ZeRO 2+).

Figure this might be a useful addition to the base module, too. I'm not 100% sure if the support for tied layers here is as flexible as it could / should be, since their capabilities are not very well documented, but it works at least for our purposes (with tied Embeddings as the output layer).

Apr 28 '21 21:04 sdtblck

In addition to to_sequential, there may be another way we could accomplish this while keeping the normal PipelineModule, if that would be useful.

If we short-circuit this condition and use the regular training engine, I think that PipelineModule should behave as a normal torch.nn.Module and you can use ZeRO-2, etc. I intended for that to be the case, but not tested these days.

https://github.com/microsoft/DeepSpeed/blob/dad26428e3f28898b8d0f5ace1b3df3e6db8f8e8/deepspeed/init.py#L119-L120

Apr 30 '21 03:04 ShadenSmith

In addition to to_sequential, there may be another way we could accomplish this while keeping the normal PipelineModule, if that would be useful.

If we short-circuit this condition and use the regular training engine, I think that PipelineModule should behave as a normal torch.nn.Module and you can use ZeRO-2, etc. I intended for that to be the case, but not tested these days.

https://github.com/microsoft/DeepSpeed/blob/dad26428e3f28898b8d0f5ace1b3df3e6db8f8e8/deepspeed/init.py#L119-L120

Hi @ShadenSmith , I actually tried this as well - and it seems this way of doing things drops any tied modules (since the pipe engine handles them specially.) So for example, if we used this with a model with tied embeddings, the to_logits function that uses the word embedding weights would just get silently dropped.

Apr 30 '21 12:04 sdtblck

This is a great idea, thanks @sdtblck !

One caveat is that we lose the activation checkpointing that the PipelineModule's forward can be configured to use. But users can instead use torch's checkpoint_sequential() if they want checkpointing. Or we could wrap the layers in a similar way as Lambda if we really want to mirror functionality. What are your thoughts?

Hm. Yeah this is a good point that I had overlooked. I'll spend some time looking into the best way to get this working today.

Apr 30 '21 12:04 sdtblck

Hi @ShadenSmith

I think the two latest commits should fix both the above requirements. There is maybe some repeated code between SequentialModel and PipelineModule that could be slimmed down - but I have tested with gpt-neox and it works well.

Apr 30 '21 14:04 sdtblck

Can one of the admins verify this patch?

Jun 09 '22 20:06 rocm-mici

@sdtblck - just fixed some formatting issues that were preventing this - if the tests pass, would this be good to merge now?

Nov 14 '23 23:11 loadams

DeepSpeed DeepSpeed copied to clipboard

Add function `to_sequential` to PipelineModule

DeepSpeed
DeepSpeed copied to clipboard