Megatron-DeepSpeed [deepspeed pipe] expand the partitioning method to support weights

[deepspeed pipe] expand the partitioning method to support weights

Open stas00 opened this issue 2 years ago • 2 comments

we will need to hack https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/module.py#L378-L384 to support partition_method type:embed:2|transformer:1 - or something like that - now the embed weights will get 2x partitioning weights and will get its own stage and all stages will be more balanced.

For context please see: https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/166#issuecomment-963818130

It's actually not complicated at all. It's just a simple weighing scheme.

Let's look at partitioning weights to the code I quoted in the first para:

with 4 layers and 4 gpus

type:transformer [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0] gets partitioned as [0, 0, 0, 1], [1], [1], [1, 0, 0, 0, 0]
type:embed|transformer [0, 1, 0, 1, 1, 1 1, 0, 0, 1, 0] gets partitioned as [0, 1, 0, 1], [1], [1], [1, 0, 0, 1, 0] (or something similar - I haven't validated),

but what we want is this:

the initial weights should be: [0, 2, 0, 1, 1, 1 1, 0, 0, 2, 0] which now should gets partitioned as [0, 2], [0, 1, 1], [1, 1], [0, 0, 2, 0]

(note: I'm not exactly sure where the 0's belong, it should be easy to see with print debug or debugger)

For context: 250k dict for mt5 has a huge embedding. it's 2x bigger than a single layer (n 104B), that's why we want them partitioned so that an embedding has its own stage and then each 2 layers use another stage.

this is so in the case of 60 layers and 2 embeddings and 32 pipe stages.

and once we are happy we can contribute this to deepspeed.

p.s. need to think about the best syntax to use, probably weighted_type:embed:2|transformer:1

Nov 09 '21 05:11 stas00

Would this involve creating a PR on the upstream?

Nov 20 '21 21:11 jaketae

This could be done with monkey patching first and then later added upstream.

I'm just not sure we should start working on it until this Issue is fixed https://github.com/microsoft/DeepSpeed/issues/1522.

As I commented in https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/166#issuecomment-963818130 we could use BNB to compensate for ZeRO1. But BNB has issues as well at the moment.

Meanwhile it was proposed to use a 150k vocab instead of 250k. I am going to see how it scales in the next few days and we will know if this is required or not. So I will update this Issue once I have more information.

thank you.

Nov 21 '21 05:11 stas00

Megatron-DeepSpeed Megatron-DeepSpeed copied to clipboard

[deepspeed pipe] expand the partitioning method to support weights

Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard