DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[REQUEST] Add torchdynamo disable decorators to graph-break on collectives

Open wconstab opened this issue 2 years ago • 1 comments

Regarding this issue: https://github.com/pytorch/pytorch/issues/97079

There are some comm ops in deepspeed, which for the moment aren't traceable by dynamo, and probably the best medium term solution is to make them graph-break at their entrypoint instead of failing to trace somewhere in their guts. This is accomplished by slapping a @torch._dynamo.disable marker on the ops in the deepspeed codebase (or, we can do it inside dynamo but I'm not sure that's the best approach).

Also note that there are 'traceable collectives' underway in pytorch, which ultimately could be swapped into deepspeed in order to allow tracing a graph without requiring a graph break. The proposal in this issue is more of a short-term solution to make the inevitable graph breaks happen more cleanly and predictably for users today.

wconstab avatar Apr 05 '23 20:04 wconstab

This is resolved by https://github.com/microsoft/DeepSpeed/pull/4878.

(or, we can do it inside dynamo but I'm not sure that's the best approach).

However I think skip whole deepspeed is still an option for torch. When it's running from the model code to deepspeed we might always assume it's some comm op (at least for what I've been using), and there's always need for a graph break. Comm ops not covered by @torch._dynamo.disable will lead to unexpected failure, e.g. https://github.com/huggingface/accelerate/pull/2460. IMO it's better/safer to add skip those packages in torch.

Also: similar things happen for triton: https://github.com/pytorch/pytorch/issues/122768

@wconstab Does that make sense to you?

oraluben avatar Apr 17 '24 08:04 oraluben