Ke Wen
Ke Wen
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #180 Status: - Switched to DTensor based TP in regular tensor path - Result is correct, but there is a perf gap...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #379 * #362 * #381 Separate the addition of 2D test from original PR #362 for easier review and landing. Also changed...
# What does this PR do? Non-persistent buffers is not saved in state dict. In the case of meta init, while loading state dict from checkpoint can fill in parameters...
Added files: - model_dist.py a mirror of model.py with Tensor Parallelism baked in. - dist_run.py toy example of how to run the model in distributed way. Test: ``` torchrun --nproc-per-node...
When composing distributed with quantization, one potential case is that the model has been quantized and saved so a second run do not need to quantize it again. This is...
### 🚀 The feature, motivation and pitch This is for aligning distributed's load behavior with single-device's case. Today distributed relies on an index file containing a `param->bin` mapping to limit...
### 🐛 Describe the bug ``` torchrun --nproc-per-node 8 dist_run.py ``` ``` known configs: ['13B', '30B', '34B', '70B', '7B', 'CodeLlama-7b-Python-hf', 'Mistral-7B', 'stories110M', 'stories15M', 'stories42M', 'Meta-Llama-3-70B', 'Meta-Llama-3-8B', 'Meta-Llama-3.1-70B-Tune', 'Meta-Llama-3.1-70B', 'Meta-Llama-3.1-8B-Tune', 'Meta-Llama-3.1-8B']...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #137763 * __->__ #135273 * #137161 * #138178 This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #137763 * #135273 * __->__ #137161 * #138178 cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o
This test was disabled because it is failing on main branch ([recent examples](https://torch-ci.com/failure?failureCaptures=%5B%22distributed%2Ftest_c10d_nccl.py%3A%3ANcclErrorHandlingTest%3A%3Atest_get_future_result%22%5D)). cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o