PiPPy icon indicating copy to clipboard operation
PiPPy copied to clipboard

Pipeline Parallelism for PyTorch

Results 123 PiPPy issues
Sort by recently updated
recently updated
newest added

I'm experimenting with pipelined training example of ResNet (`pippy_resnet.py`) from [https://github.com/pytorch/PiPPy/tree/main/examples/resnet](https://github.com/pytorch/PiPPy/tree/main/examples/resnet). Namely, I want to compare the loss when running locally on one GPU and when running using `pippy`. I...

After certain torch 2.2.0.dev version, submod_0, submod_1, submod_2 ... are named as submod_0, submod_2, submod_4 ... Hence this assert would fail: ``` pippy/IR.py", line 682, in _number_and_count_forward_stages assert all(i in...

When I run “torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:29500 --nnodes=1 --nproc-per-node=4 test_pipeline_schedule.py --schedules gpipe”,I got the following outputs: ```shell [2023-12-03 08:40:53,722] torch.distributed.run: [WARNING] [2023-12-03 08:40:53,722] torch.distributed.run: [WARNING] ***************************************** [2023-12-03 08:40:53,722] torch.distributed.run: [WARNING] Setting...

Graph interpretation refers to: - Figuring out stage module to rank mapping - Figuring out stage-to-stage communication relationship (connection, tensor transmission size, etc) Pipeline executor refers to: - Running micro-chunked...

Using latest nightly (1109) and running on H100 server: running tests/local_test_c10d.py results in the final tensor comparison failing with 16% mismatch (appears to be rounding, largest diff is .0097). ~~~...

I'm trying to do fine-tuning of language-modeling, freezing some first layers of RoBERTa. The code is pretty similar to `run_mlm.py` example from [https://github.com/pytorch/PiPPy/tree/main/examples/hf/language-modeling](https://github.com/pytorch/PiPPy/tree/main/examples/hf/language-modeling). But I get error in step of...

I'm trying to run a model based on **RoBERTa** with analogy on `run_mlm.py` example from [https://github.com/pytorch/PiPPy/tree/main/examples/hf/language-modeling](https://github.com/pytorch/PiPPy/tree/main/examples/hf/language-modeling). But when using function `split_into_equal_size`, I get a submodule without any layers. This can...

I played with the [`hf_generate` ](https://github.com/pytorch/tau/pull/772) branch and it seems quite ready to be expanded to support BLOOM-3B/7B1 models etc (https://github.com/zsc/tau/pull/1 ). Great work! Is there an imminent plan to...

In my understanding, pipeline parallelism is decentralized, but why is a master needed in the example. args.world_size = 5 # "This program requires exactly 4 workers + 1 master"

root@6496cf66be1e:/workspace/PiPPy/examples/resnet# python pippy_resnet.py -s=1F1B [PiPPy] World size: 5, DP group size: 1, PP group size: 5 rank = 4 host/pid/device = 6496cf66be1e/2823/cuda:4 [W socket.cpp:601] [c10d] The client socket has failed...