Howard Huang

Results 14 issues of Howard Huang

Stack from [ghstack](https://github.com/ezyang/ghstack): * **#91257 Collective dispatching from Process Group** Fixes https://github.com/pytorch/pytorch/issues/90932 Fixes https://github.com/pytorch/pytorch/issues/90659 Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup`

release notes: distributed (c10d)

Add doc string for manual stage and example under `basic/` Made input_args a required argument Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #1109

cla signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #1079 * #1077

cla signed

I noticed for many of my PRs after running `./format.sh`, it still does not pass the checks in `./check.sh`. This causes the PR to fail in the lint check in...

better engineering

Add try-except around the forward to also log the stage, shapes, etc. before reraising the exception. Look into which debug flags can be used to handle the hang cases. Document...

better engineering

Will need to update which group the batch_p2p ops are sent to and remove the current assumptions using rank+1 and rank-1.

enhancement

Loss function is currently not implemented: https://github.com/pytorch/PiPPy/blob/f2e605d045cdc64cac31e2dd99a01706eb638a16/pippy/PipelineSchedule.py#L68-L73 We should add the loss function as an argument into PipelineSchedule.step(). This also means that we should change the output of `forward()`: -...

enhancement

""" fwd_outputs all forced to have 'requires_grad=True' -- why? what's our design here? freqs_cis could be passed from stage0 to stage1 but is an input value from dataloader and should...

bug

Summary: ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already...

CLA Signed
fb-exported