Fix: timers('interval-time') bug and abnormal termination
The save_checkpoint_and_time() function now internally manages the 'interval-time' timer. As a result, the external calls to start and stop this timer should be removed to prevent conflicts and potential errors due to double timing.
During the pre-training of a GPT model, if the --profile flag is enabled and the exit-interval is greater than the profile-step-start, the application would abnormally terminate with a -6 exit status.
Marking as stale. No activity in 60 days.
Marking as stale. No activity in 60 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.
Thanks for the PR! The timer issue has already been resolved in the current codebase. The redundant timers were removed in commit https://github.com/NVIDIA/Megatron-LM/commit/cef51542c378dcbb81143dc311132b5095394b17.
The profiler cleanup logic you identified is still valid. Feel free to submit a focused PR for that specific issue.