PiPPy issues

ResNet example always underfitting when pippy training

5

I'm experimenting with pipelined training example of ResNet (`pippy_resnet.py`) from [https://github.com/pytorch/PiPPy/tree/main/examples/resnet](https://github.com/pytorch/PiPPy/tree/main/examples/resnet). Namely, I want to compare the loss when running locally on one GPU and when running using `pippy`. I...

hakob-petro

PyTorch renaming submod indices leading to assert break

After certain torch 2.2.0.dev version, submod_0, submod_1, submod_2 ... are named as submod_0, submod_2, submod_4 ... Hence this assert would fail: ``` pippy/IR.py", line 682, in _number_and_count_forward_stages assert all(i in...

kwen2501

Pipeline Schedule confused

1

When I run “torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:29500 --nnodes=1 --nproc-per-node=4 test_pipeline_schedule.py --schedules gpipe”，I got the following outputs: ```shell [2023-12-03 08:40:53,722] torch.distributed.run: [WARNING] [2023-12-03 08:40:53,722] torch.distributed.run: [WARNING] ***************************************** [2023-12-03 08:40:53,722] torch.distributed.run: [WARNING] Setting...

wuhouming

Decouple graph interpretation from pipeline executor

Graph interpretation refers to: - Figuring out stage module to rank mapping - Figuring out stage-to-stage communication relationship (connection, tensor transmission size, etc) Pipeline executor refers to: - Running micro-chunked...

kwen2501

[H100] local test C10D forward does not have tensor result equivalency (16% mismatch)

Using latest nightly (1109) and running on H100 server: running tests/local_test_c10d.py results in the final tensor comparison failing with 16% mismatch (appears to be rounding, largest diff is .0097). ~~~...

lessw2020

Failed to run fine-tuning (freezing some layers) of hf model with pippy

I'm trying to do fine-tuning of language-modeling, freezing some first layers of RoBERTa. The code is pretty similar to `run_mlm.py` example from [https://github.com/pytorch/PiPPy/tree/main/examples/hf/language-modeling](https://github.com/pytorch/PiPPy/tree/main/examples/hf/language-modeling). But I get error in step of...

hakob-petro

split_into_equal_size returns submodules with non-optimizable parameters

I'm trying to run a model based on **RoBERTa** with analogy on `run_mlm.py` example from [https://github.com/pytorch/PiPPy/tree/main/examples/hf/language-modeling](https://github.com/pytorch/PiPPy/tree/main/examples/hf/language-modeling). But when using function `split_into_equal_size`, I get a submodule without any layers. This can...

hakob-petro

Any plan to support PEFT LoRA models?

2

I played with the [`hf_generate` ](https://github.com/pytorch/tau/pull/772) branch and it seems quite ready to be expanded to support BLOOM-3B/7B1 models etc (https://github.com/zsc/tau/pull/1 ). Great work! Is there an imminent plan to...

zsc

Why does parallel pipeline require a master

1

In my understanding, pipeline parallelism is decentralized, but why is a master needed in the example. args.world_size = 5 # "This program requires exactly 4 workers + 1 master"

lengien

How did this error happen when i run example about resnet?

root@6496cf66be1e:/workspace/PiPPy/examples/resnet# python pippy_resnet.py -s=1F1B [PiPPy] World size: 5, DP group size: 1, PP group size: 5 rank = 4 host/pid/device = 6496cf66be1e/2823/cuda:4 [W socket.cpp:601] [c10d] The client socket has failed...

lengien

PiPPy
PiPPy copied to clipboard

Metadata

ResNet example always underfitting when pippy training

PyTorch renaming submod indices leading to assert break

Pipeline Schedule confused

Decouple graph interpretation from pipeline executor

[H100] local test C10D forward does not have tensor result equivalency (16% mismatch)

Failed to run fine-tuning (freezing some layers) of hf model with pippy

split_into_equal_size returns submodules with non-optimizable parameters

Any plan to support PEFT LoRA models?

Why does parallel pipeline require a master

How did this error happen when i run example about resnet?

← Metadata

Owner

Metadata

PiPPy PiPPy copied to clipboard

Metadata

← Metadata

Owner

Metadata

PiPPy
PiPPy copied to clipboard