determined
determined copied to clipboard
fix: dsat search space updates
Description
The main goal of this PR was to update the search space used for our DeepSpeed Autotune (dsat
) module. Previously the stage-3 search space searched over irrelevant fields (allgather_bucket_size, reduce_scatter
) and ignored other important, stage-3 specific fields (stage3_param_persistence_threshold, stage3_prefetch_bucket_size, ...
). In particular, in some scenarios HF already sets default values for several of these fields, which seem to be quite strong.
Despite the original intentions, other features were also implemented along the way:
-
--divisible-by
: forces all searched-over batch sizes to be divisible by this factor -
--train-batch-size
: only searches overgradient_accumulation_steps/train_microbatch_size_per_gpu
pairs which satisfytrain_batch_size == slots_per_trial * gradient_accumulation_steps * train_microbatch_size_per_gpu
TODOs:
- Update docs. Moved some flags, others new. Changed
start/end-profile-step
defaults and default metric tothroughput
Test Plan
Commentary (optional)
Checklist
- [ ] Changes have been manually QA'd
- [ ] User-facing API changes need the "User-facing API Change" label.
- [ ] Release notes should be added as a separate file under
docs/release-notes/
. See Release Note for details. - [ ] Licenses should be included for new code which was copied and/or modified from any external code.
Ticket
Deploy Preview for determined-ui canceled.
Name | Link |
---|---|
Latest commit | 7be633151c3fbadd8866df6ea3bdb3e485ad0772 |
Latest deploy log | https://app.netlify.com/sites/determined-ui/deploys/64dbe34a85a9f1000881f6fe |