determined icon indicating copy to clipboard operation
determined copied to clipboard

feat: stop using workload sequencer for PyTorchTrial [MLG-184]

Open azhou-determined opened this issue 2 years ago • 2 comments

Description

summary of calls in existing workload sequencer for reference:

startup callbacks
set data loaders
load from checkpoint
hvd.broadcast parameters/optimizer state
try:
    for op in searcher_ops: (workloads)
        for batch in op.batches: (workload)
            train until min(checkpoint, validation, op complete, scheduling unit)
            train:
                train loop
                    train step
                    broadcast metrics
                    on_training_workload_end callback
                report training metrics
                report searcher progress
                check for preemption
            checkpoint/validate/finish/keep going
            checkpoint:
                update state last checkpoint
                save checkpoint
                    only on chief, save and broadcast uuid
                    call checkpoint_upload_end callbacks
                check for preemption
            validate:
                validation loop
                report searcher progress/complete
                report validation metrics
                maybe checkpoint (checkpoint policy)
                check for preemption
            upload_tb_files
        finish:
            checkpoint if latest checkpoint isn't latest
            validate

except ShouldExit:
    checkpoint if not latest

Test Plan

Ensure existing functionality of PyTorchTrial training:

  • Training/validation steps
  • Save
  • Resume training

Commentary (optional)

Checklist

  • [ ] Changes have been manually QA'd
  • [ ] User-facing API changes need the "User-facing API Change" label.
  • [ ] Release notes should be added as a separate file under docs/release-notes/. See Release Note for details.
  • [ ] Licenses should be included for new code which was copied and/or modified from any external code.
  • [ ] If modifying /webui/react/src/shared/ verify make -C webui/react test-shared passes.

azhou-determined avatar Sep 19 '22 19:09 azhou-determined

Deploy Preview for storybook-det canceled.

Name Link
Latest commit df4399e3418c72857fe3bbff52e864e1cd140b4c
Latest deploy log https://app.netlify.com/sites/storybook-det/deploys/637447eea6b24d0008b47e37

netlify[bot] avatar Sep 19 '22 19:09 netlify[bot]

Deploy Preview for determined-ui canceled.

Name Link
Latest commit df4399e3418c72857fe3bbff52e864e1cd140b4c
Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/637447ee53dca30008b13a39

netlify[bot] avatar Sep 19 '22 19:09 netlify[bot]