torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

Implement async_checkpoint

Open fegin opened this issue 1 year ago • 0 comments

Stack from ghstack (oldest at bottom):

  • -> #302

Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned memory + a seperate process to avoid GILs issue.

fegin avatar May 03 '24 21:05 fegin