PiPPy
PiPPy copied to clipboard
Could pippy be coexisted with deepspeed?
Hi,
I want to know whether I could use pippy's pp capability with deepspeed's zero3 config? So that it together lead to 3d parallism?
Thx
Hi @leiwen83, that's an interesting question.
I think at the Zero-2 stage (where the gradients are sharded), there would need to be some special arrangement: As each micro-batch runs their backward stage, their gradients need to be accumulated, so one would need to delay the reduce_scatter of gradients in Zero-2, and run it only once, after all micro-batches pass through that backward stage.
Cc @rohan-varma to see if you have any additional thoughts.