Chien-Chin Huang

Results 28 issues of Chien-Chin Huang

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #125339 * #125338 * __->__ #125337 * #125336 * #125335 * #125334 * #125333 Summary: Fixes #122792 state_dict includes only persistent buffers, while...

oncall: distributed
ciflow/trunk
ciflow/periodic
module: distributed_checkpoint

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #125339 * #125338 * #125337 * #125336 * #125335 * #125334 * __->__ #125333 Summary: 1. Avoid using `torch._dynamo.disable`. 2. Clear the LRU...

oncall: distributed
ciflow/trunk
release notes: distributed (c10d)
module: dynamo
ciflow/inductor

Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #302 Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned...

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #319

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #360 and optimizers

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #518 This PR enables HSDP. **Discussions** **1. How does trainer get DP mesh?** Right now, we flatten `["dp_replicate", "dp_shard"]` into a flattened...

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #433 This PR adds experimental flags and functions to enable context parallelism. We currently support on ly FSDP + CP and CP...

CLA Signed

**Why do we need this?** There have been a lot of asks to get the HF checkpoints work with TorchTitan. There are already workarounds for this problem. However, the converted...

CLA Signed

Add TorchFT integration test

CLA Signed