Ke Wen

Results 36 issues of Ke Wen

Dear developers, can you please help with the following errors please? Thank you! ``` $ git clone https://github.com/google/nccl-fastsocket.git $ cd nccl-fastsocket $ bazel build :all WARNING: Output base '/home/user/.cache/bazel/_bazel_user/1340a46a9e7502c5cf03e1a0a087e4f3' is...

Purpose of this PR is to show: 1. One line change needed -- remove this line: ``` self.freqs_cis = self.freqs_cis.to(h.device) ``` Reason 1: compile does not support in-place attribute mutation....

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #125449 * #125448 * #125273 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj...

oncall: distributed
topic: not user facing
ci-td-distributed

`torch.export` has strict mode and non-strict mode. For difference, please read [Non-Strict Export](https://pytorch.org/docs/stable/export.html#non-strict-export). This PR switches to non-strict mode by default. Improving tracing success rate (no Dynamo graph break).

cla signed

Currently every test defines its own example model. We should have a model registry to deduplicate those models, and the tests just fetch from it.

Test case: ``` torchrun --nproc-per-node 4 test_fwd.py ``` Reason: ![Screenshot 2024-03-14 at 12 05 52 PM](https://github.com/pytorch/PiPPy/assets/6676466/83cb05ba-3cbc-4276-9275-d868e0c22be9) When stage 0 finishes computation and hit batch_send, all corresponding comm’s from other ranks...

high-pri

![Screenshot 2024-03-06 at 11 54 48 AM](https://github.com/pytorch/PiPPy/assets/6676466/77462940-7696-41b6-b8a2-e47b7a444dad) Need to investigate if this is a test issue or pippy issue or general pytorch issue.

## Current status Working ``` # PP = 2, TP = 4 $ torchrun --nproc-per-node 8 pippy_llama.py ['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right'] ['make', 'think', 'you', 'be', 'getting',...

cla signed