Ke Wen issues

Results 36 issues of


                                            Ke Wen

[WIP] Use DTensor-based tensor parallel

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #180 Status: - Switched to DTensor based TP in regular tensor path - Result is correct, but there is a perf gap...

CLA Signed

Add PP tracer + DP test

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #379 * #362 * #381 Separate the addition of 2D test from original PR #362 for easier review and landing. Also changed...

CLA Signed

Register buffer init callbacks in llama

# What does this PR do? Non-persistent buffers is not saved in state dict. In the case of meta init, while loading state dict from checkpoint can fill in parameters...

[WIP] Initial add of distributed model

Added files: - model_dist.py a mirror of model.py with Tensor Parallelism baked in. - dist_run.py toy example of how to run the model in distributed way. Test: ``` torchrun --nproc-per-node...

CLA Signed

[Not for land] Util for saving quantized model

When composing distributed with quantization, one potential case is that the model has been quantized and saved so a second run do not need to quantize it again. This is...

CLA Signed

[Distributed] Support loading from single checkpoint binary

### 🚀 The feature, motivation and pitch This is for aligning distributed's load behavior with single-device's case. Today distributed relies on an index file containing a `param->bin` mapping to limit...

[Distributed] Did not find tokenizer at {tokenizer_path}

### 🐛 Describe the bug ``` torchrun --nproc-per-node 8 dist_run.py ``` ``` known configs: ['13B', '30B', '34B', '70B', '7B', 'CodeLlama-7b-Python-hf', 'Mistral-7B', 'stories110M', 'stories15M', 'stories42M', 'Meta-Llama-3-70B', 'Meta-Llama-3-8B', 'Meta-Llama-3.1-70B-Tune', 'Meta-Llama-3.1-70B', 'Meta-Llama-3.1-8B-Tune', 'Meta-Llama-3.1-8B']...

Distributed

[Distributed] Fix extra context on device 0

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #137763 * __->__ #135273 * #137161 * #138178 This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard...

oncall: distributed

ciflow/trunk

release notes: distributed (c10d)

topic: bug fixes

ciflow/periodic

ciflow/inductor

keep-going

Upgrade distributed test to g4dn instances (T4 GPUs)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #137763 * #135273 * __->__ #137161 * #138178 cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

oncall: distributed

ciflow/trunk

topic: not user facing

ciflow/periodic

ciflow/inductor

keep-going

ciflow/rocm

ci-no-td

DISABLED test_get_future_result (main.NcclErrorHandlingTest)

This test was disabled because it is failing on main branch ([recent examples](https://torch-ci.com/failure?failureCaptures=%5B%22distributed%2Ftest_c10d_nccl.py%3A%3ANcclErrorHandlingTest%3A%3Atest_get_future_result%22%5D)). cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

oncall: distributed

skipped