algorithmic-efficiency icon indicating copy to clipboard operation
algorithmic-efficiency copied to clipboard

Inform submission about `accumulated_submission_time`

Open Niccolo-Ajroldi opened this issue 1 year ago • 0 comments

Description

Currently, update_params has no up-to-date information about the elapsed time since start.

My motivation for adding this feature is to simplify the implementation of a time-based learning rate schedule.

Can't a submission just keep track of time or estimate it? In theory yes, this is allowed by the rules and feasible. However, such implementation would require synchronization among devices inside update_params when training in distributed mode, which would penalize such submission.

Why is a time-based scheduler useful? Currently, a submission can implement a LR scheduler using step_hint as a step budget. This is a reliable estimate of the number of steps needed for (N)AdamW to reach max_runtime. However, a submission could be faster/slower than (N)AdamW, and the extent of this difference can vary based on the workload itself. This makes deriving a custom step budget from step_hint suboptimal.

Implementation

We could simply pass train_state to update_params, or even just train_state['accumulated_submission_time']. Notice that update_params already receives in input eval_results, which informs the submission about the elapsed time at the moment of the last evaluation, which is not up-to date with accumulated_submission_time.

Niccolo-Ajroldi avatar Sep 12 '24 15:09 Niccolo-Ajroldi