How to do continue training when a job failed

Open viyjy opened this issue 2 years ago • 1 comments

Hi, for example I am training a job using this yaml, how to do continue training if this job failed? Thanks.

Dec 13 '23 17:12 viyjy

You can add a load path as a trainer argument in that yaml to resume a job from an earlier checkpoint.

Something like this:

trainer:
  _target_: composer.Trainer
  device: gpu
  max_duration: 850000ba
  eval_interval: 10000ba
  device_train_microbatch_size: 16
  run_name: ${name}
  seed: ${seed}
  load_path: # Path to checkpoint to resume training from
  save_folder:  # Insert path to save folder or bucket
  save_interval: 10000ba
  save_overwrite: true
  autoresume: false
  fsdp_config:
    sharding_strategy: "SHARD_GRAD_OP"

Jan 03 '24 23:01 coryMosaicML