DeepSpeed Fix deadlock in PipeEngine._exec_recv

I'm using Megtron-DeepSpeed with TP/PP/DP. In my case there are three tensors need to communicate between pipelines:

hidden_state (floating, need grad)
attention_mask (int32, no grad)
cached_rotray_embedding (floating, no grad)

Only first tensor has grad which meets the restriction of PipelineEngine here: https://github.com/microsoft/DeepSpeed/blob/3dd7ccff8103be60c31d963dd2278d43abb68fd1/deepspeed/runtime/pipe/engine.py#L734-L736

Only grads of first tensor sended in first stage: https://github.com/microsoft/DeepSpeed/blob/3dd7ccff8103be60c31d963dd2278d43abb68fd1/deepspeed/runtime/pipe/engine.py#L1106-L1109

But the next stage try to recv more than one grad because tensor.is_floating_point() is used to filter outputs. In my case cached_rotray_embedding is floating tensor with no grad which caught by filter. Next stage expecting more data than sended makes training hangs. https://github.com/microsoft/DeepSpeed/blob/3dd7ccff8103be60c31d963dd2278d43abb68fd1/deepspeed/runtime/pipe/engine.py#L1206-L1209

Since only one grad is send anyway, we don't need is_floating_point filter here.

May 10 '24 02:05 i4never

rebase master

May 14 '24 01:05 i4never

Hi @tjruwase, Could you review this pls?

May 15 '24 03:05 i4never

Hi @tjruwase, Could you review this pls?

@i4never, thanks for this PR. This is very old code, which is documented as hacky, and unfortunately the author is no longer available. Although, I think your changes is correct, I think it is best to be more cautious. So, can you please add some unit tests?

May 15 '24 15:05 tjruwase

Hi @tjruwase, Could you review this pls?

@i4never, thanks for this PR. This is very old code, which is documented as hacky, and unfortunately the author is no longer available. Although, I think your changes is correct, I think it is best to be more cautious. So, can you please add some unit tests?

@i4never - would you be able to add any unit tests?

Jul 16 '24 17:07 loadams

sure, I'm working on this.

Jul 18 '24 03:07 i4never

sure, I'm working on this.

Thanks @i4never - ping me whenever this needs review/tests/etc.

Jul 24 '24 16:07 loadams

DeepSpeed
DeepSpeed copied to clipboard

Fix deadlock in PipeEngine._exec_recv_grads

DeepSpeed DeepSpeed copied to clipboard

Fix deadlock in PipeEngine._exec_recv_grads

DeepSpeed
DeepSpeed copied to clipboard