PiPPy
PiPPy copied to clipboard
Pipeline Parallelism for PyTorch
Authoring and testing sharded RNG operations needs love. * Setting the seed, via `torch.manual_seed(seed)` does not dispatch to shards. * Constructing ops so that they'll produce the _same_ RNG choices,...
This depends on issue https://github.com/pytorch/pytorch/issues/85234 It's hard to debug since the abort doesn't generate stack trace or any exceptions that can be caught.
tanh is part of the dtensor_lagging_op_db but is also in xfail() When I adding support for tanh I don't remember making any modifications other than tests. Figure out what is...
Add support for backward() in test_dtensor_ops.py since that will cover FW + BW.
As part of FSDP+TP integration we use construct Tensors using TensorInfo.from_tensor which calls is_pinned: I'm getting the following error with it: ``` File "/data/home/kumpera/repos/PiPPy/spmd/spmd/tensor/dispatch.py", line 174, in operator_dispatch raise RuntimeError(...
``` -- Process 0 terminated with the following error: Traceback (most recent call last): File "/data/home/kw2501/repos/PiPPy/PiPPy/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data/home/kw2501/repos/PiPPy/pippy/utils.py", line 107, in run_worker run_master(pp_ranks_per_dp_group[rank], args,...
1. Create a branch `hf_example_summarization` 2. Copy [summarization](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) dir to [hf](https://github.com/pytorch/PiPPy/tree/main/examples/hf) 3. Add files, commit, create a PR(`hf_example_summarization`->`main`) 4. Create another branch `hf_example_summarization_pippy` on top of `hf_example_summarization` and all PiPPy...