PiPPy issues

PipelineStage/Schedule issues

2

Just dumping issues here as I find them (applying PipelineStage to torchtrain) Stage 1. fwd_inputs all forced to have 'requires_grad=True' -- why? what's our design here? `freqs_cis` could be passed...

wconstab

format.sh does not format correctly enough for check.sh

I noticed for many of my PRs after running `./format.sh`, it still does not pass the checks in `./check.sh`. This causes the PR to fail in the lint check in...

H-Huang

better engineering

PipelineStage: Improve error logging and debuggability

Add try-except around the forward to also log the stage, shapes, etc. before reraising the exception. Look into which debug flags can be used to handle the hang cases. Document...

H-Huang

better engineering

Add device mesh / process group support for PipelineStage

Will need to update which group the batch_p2p ops are sent to and remove the current assumptions using rank+1 and rank-1.

H-Huang

enhancement

Add support for loss function, update the PipelineStage output

Loss function is currently not implemented: https://github.com/pytorch/PiPPy/blob/f2e605d045cdc64cac31e2dd99a01706eb638a16/pippy/PipelineSchedule.py#L68-L73 We should add the loss function as an argument into PipelineSchedule.step(). This also means that we should change the output of `forward()`: -...

H-Huang

enhancement

Fix backward implementation and remove setting grad in forward()

""" fwd_outputs all forced to have 'requires_grad=True' -- why? what's our design here? freqs_cis could be passed from stage0 to stage1 but is an input value from dataloader and should...

H-Huang

bug

PyTorch native 2D LLaMA inference

3

## Current status Working ``` # PP = 2, TP = 4 $ torchrun --nproc-per-node 8 pippy_llama.py ['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right'] ['make', 'think', 'you', 'be', 'getting',...

kwen2501

cla signed

Issue with optimizer instantiation

2

Hi, I get g\the below error whenever I try to create an optimizer. Please help optimizer = driver.instantiate_optimizer(torch.optim.Adam) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/nucleus/lib/python3.11/site-packages/torchpippy-0.1.1+8f549f3-py3.11.egg/pippy/PipelineDriver.py", line 1573, in instantiate_optimizer return PipelineOptimizer( ^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/nucleus/lib/python3.11/site-packages/torchpippy-0.1.1+8f549f3-py3.11.egg/pippy/PipelineDriver.py",...

saiajaym