PiPPy
PiPPy copied to clipboard
Pipeline Parallelism for PyTorch
Just dumping issues here as I find them (applying PipelineStage to torchtrain) Stage 1. fwd_inputs all forced to have 'requires_grad=True' -- why? what's our design here? `freqs_cis` could be passed...
I noticed for many of my PRs after running `./format.sh`, it still does not pass the checks in `./check.sh`. This causes the PR to fail in the lint check in...
Add try-except around the forward to also log the stage, shapes, etc. before reraising the exception. Look into which debug flags can be used to handle the hang cases. Document...
Will need to update which group the batch_p2p ops are sent to and remove the current assumptions using rank+1 and rank-1.
Loss function is currently not implemented: https://github.com/pytorch/PiPPy/blob/f2e605d045cdc64cac31e2dd99a01706eb638a16/pippy/PipelineSchedule.py#L68-L73 We should add the loss function as an argument into PipelineSchedule.step(). This also means that we should change the output of `forward()`: -...
""" fwd_outputs all forced to have 'requires_grad=True' -- why? what's our design here? freqs_cis could be passed from stage0 to stage1 but is an input value from dataloader and should...
## Current status Working ``` # PP = 2, TP = 4 $ torchrun --nproc-per-node 8 pippy_llama.py ['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right'] ['make', 'think', 'you', 'be', 'getting',...
Hi, I get g\the below error whenever I try to create an optimizer. Please help optimizer = driver.instantiate_optimizer(torch.optim.Adam) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/nucleus/lib/python3.11/site-packages/torchpippy-0.1.1+8f549f3-py3.11.egg/pippy/PipelineDriver.py", line 1573, in instantiate_optimizer return PipelineOptimizer( ^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/nucleus/lib/python3.11/site-packages/torchpippy-0.1.1+8f549f3-py3.11.egg/pippy/PipelineDriver.py",...