Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

Fix: timers('interval-time') bug and abnormal termination

Open bingnandu opened this issue 1 year ago • 1 comments

The save_checkpoint_and_time() function now internally manages the 'interval-time' timer. As a result, the external calls to start and stop this timer should be removed to prevent conflicts and potential errors due to double timing. save save1 image

bingnandu avatar Aug 20 '24 15:08 bingnandu

During the pre-training of a GPT model, if the --profile flag is enabled and the exit-interval is greater than the profile-step-start, the application would abnormally terminate with a -6 exit status. image

bingnandu avatar Aug 20 '24 16:08 bingnandu

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Dec 29 '24 18:12 github-actions[bot]

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Mar 12 '25 18:03 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Aug 01 '25 02:08 github-actions[bot]

Thanks for the PR! The timer issue has already been resolved in the current codebase. The redundant timers were removed in commit https://github.com/NVIDIA/Megatron-LM/commit/cef51542c378dcbb81143dc311132b5095394b17.

The profiler cleanup logic you identified is still valid. Feel free to submit a focused PR for that specific issue.

sbhavani avatar Oct 05 '25 22:10 sbhavani