Junjie Wang
Junjie Wang
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #96989 Differential Revision: [D44158327](https://our.internmc.facebook.com/intern/diff/D44158327)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #96985 * #96989 Differential Revision: [D44158326](https://our.internmc.facebook.com/intern/diff/D44158326)
As part of ShardedTensor deprecation, we start the cleanup for its use case in torch snapshot. This is the first PR for a series PR and want to get feedback...
In this PR, we mostly measured the performance and loss curves for 405B model with some optimizations techniques we recently developed. We also want to log the actual peak TFLOPs...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #134528 * #134383 - This PR generates a more useful output log for users: P1552399180. - It also fixes the logic when...
This is first step to include more models into torchtitan to demonstrate composability of pretrain. Now with llama 3.2 coming and we already have it available in torch tune. We...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #140975 We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy...