torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

Fix the incorrect step log for profiler after resuming from a checkpoint

Open fegin opened this issue 1 year ago • 0 comments

Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.

fegin avatar May 02 '24 06:05 fegin