determined icon indicating copy to clipboard operation
determined copied to clipboard

fix: dsat search space updates

Open garrett361 opened this issue 1 year ago • 1 comments

Description

The main goal of this PR was to update the search space used for our DeepSpeed Autotune (dsat) module. Previously the stage-3 search space searched over irrelevant fields (allgather_bucket_size, reduce_scatter) and ignored other important, stage-3 specific fields (stage3_param_persistence_threshold, stage3_prefetch_bucket_size, ...). In particular, in some scenarios HF already sets default values for several of these fields, which seem to be quite strong.

Despite the original intentions, other features were also implemented along the way:

  • --divisible-by: forces all searched-over batch sizes to be divisible by this factor
  • --train-batch-size: only searches over gradient_accumulation_steps/train_microbatch_size_per_gpu pairs which satisfy train_batch_size == slots_per_trial * gradient_accumulation_steps * train_microbatch_size_per_gpu

TODOs:

  • Update docs. Moved some flags, others new. Changed start/end-profile-step defaults and default metric to throughput

Test Plan

Commentary (optional)

Checklist

  • [ ] Changes have been manually QA'd
  • [ ] User-facing API changes need the "User-facing API Change" label.
  • [ ] Release notes should be added as a separate file under docs/release-notes/. See Release Note for details.
  • [ ] Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

garrett361 avatar Jul 19 '23 16:07 garrett361

Deploy Preview for determined-ui canceled.

Name Link
Latest commit 7be633151c3fbadd8866df6ea3bdb3e485ad0772
Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/64dbe34a85a9f1000881f6fe

netlify[bot] avatar Jul 19 '23 16:07 netlify[bot]