DeepSpeed
DeepSpeed copied to clipboard
Fix deadlock in PipeEngine._exec_recv_grads
I'm using Megtron-DeepSpeed with TP/PP/DP. In my case there are three tensors need to communicate between pipelines:
hidden_state(floating, need grad)attention_mask(int32, no grad)cached_rotray_embedding(floating, no grad)
Only first tensor has grad which meets the restriction of PipelineEngine here: https://github.com/microsoft/DeepSpeed/blob/3dd7ccff8103be60c31d963dd2278d43abb68fd1/deepspeed/runtime/pipe/engine.py#L734-L736
Only grads of first tensor sended in first stage: https://github.com/microsoft/DeepSpeed/blob/3dd7ccff8103be60c31d963dd2278d43abb68fd1/deepspeed/runtime/pipe/engine.py#L1106-L1109
But the next stage try to recv more than one grad because tensor.is_floating_point() is used to filter outputs. In my case cached_rotray_embedding is floating tensor with no grad which caught by filter. Next stage expecting more data than sended makes training hangs.
https://github.com/microsoft/DeepSpeed/blob/3dd7ccff8103be60c31d963dd2278d43abb68fd1/deepspeed/runtime/pipe/engine.py#L1206-L1209
Since only one grad is send anyway, we don't need is_floating_point filter here.
rebase master
Hi @tjruwase, Could you review this pls?
Hi @tjruwase, Could you review this pls?
@i4never, thanks for this PR. This is very old code, which is documented as hacky, and unfortunately the author is no longer available. Although, I think your changes is correct, I think it is best to be more cautious. So, can you please add some unit tests?
Hi @tjruwase, Could you review this pls?
@i4never, thanks for this PR. This is very old code, which is documented as hacky, and unfortunately the author is no longer available. Although, I think your changes is correct, I think it is best to be more cautious. So, can you please add some unit tests?
@i4never - would you be able to add any unit tests?
sure, I'm working on this.
sure, I'm working on this.
Thanks @i4never - ping me whenever this needs review/tests/etc.