Chien-Chin Huang
Chien-Chin Huang
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #125339 * #125338 * __->__ #125337 * #125336 * #125335 * #125334 * #125333 Summary: Fixes #122792 state_dict includes only persistent buffers, while...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #125339 * #125338 * #125337 * #125336 * #125335 * #125334 * __->__ #125333 Summary: 1. Avoid using `torch._dynamo.disable`. 2. Clear the LRU...
Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #302 Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #319
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #360 and optimizers
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #518 This PR enables HSDP. **Discussions** **1. How does trainer get DP mesh?** Right now, we flatten `["dp_replicate", "dp_shard"]` into a flattened...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #433 This PR adds experimental flags and functions to enable context parallelism. We currently support on ly FSDP + CP and CP...
**Why do we need this?** There have been a lot of asks to get the HF checkpoints work with TorchTitan. There are already workarounds for this problem. However, the converted...