torchtitan
torchtitan copied to clipboard
Implement async_checkpoint
Stack from ghstack (oldest at bottom):
- -> #302
Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned memory + a seperate process to avoid GILs issue.