torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

Implement fast checkpoint path

Open fegin opened this issue 1 year ago • 0 comments

This PR uses shared memory to do async checkpoint on another process and also implements async staging (overlapping staging with the next iteration).

fegin avatar Mar 12 '24 17:03 fegin