Ke Wen

Results 36 issues of Ke Wen

After certain torch 2.2.0.dev version, submod_0, submod_1, submod_2 ... are named as submod_0, submod_2, submod_4 ... Hence this assert would fail: ``` pippy/IR.py", line 682, in _number_and_count_forward_stages assert all(i in...

Graph interpretation refers to: - Figuring out stage module to rank mapping - Figuring out stage-to-stage communication relationship (connection, tensor transmission size, etc) Pipeline executor refers to: - Running micro-chunked...

CI failure caused by HF changes. ``` test/hf_test.py:637: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _...

``` -- Process 0 terminated with the following error: Traceback (most recent call last): File "/data/home/kw2501/repos/PiPPy/PiPPy/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data/home/kw2501/repos/PiPPy/pippy/utils.py", line 107, in run_worker run_master(pp_ranks_per_dp_group[rank], args,...

enhancement

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #125975 * __->__ #125729 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab...

oncall: distributed
ciflow/trunk
release notes: distributed (pipeline)
merging
ci-td-distributed

1. Add pipeline schedules: - GPipe - 1F1B - Interleaved 1F1B - LoopedBFS 2. Add basic forward and backward tests: test_schedule.py Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #125975...

oncall: distributed
release notes: distributed (pipeline)
ci-td-distributed

### 🚀 The feature, motivation and pitch When running tracer mode with torchtitan, the follow `NotImplementedError` was raised: ``` Parameter freqs_cis used in multiple stages: {submod_0: None, submod_1: None}. Currently,...

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #126735 * #126721 * #126812 Added manual stage in test_schedule.py so that we can test various schedules against it. In this file...

oncall: distributed
ciflow/trunk
topic: not user facing
merging

### 🐛 Describe the bug The following Llama2 program used to work, but failed recently. ### Error logs ``` Traceback (most recent call last): File "/data/users/kw2501/pytorch/torch/distributed/pipelining/_IR.py", line 1005, in _trace_with_export...

high priority
triage review
oncall: pt2
oncall: export

Models can be big. Therefore we would need to: - create the model's "skeleton" on meta device - partition it so that it can fit on each device, and -...

cla signed