diffusion
diffusion copied to clipboard
How to do continue training when a job failed
Hi, for example I am training a job using this yaml, how to do continue training if this job failed? Thanks.
You can add a load path as a trainer argument in that yaml to resume a job from an earlier checkpoint.
Something like this:
trainer:
_target_: composer.Trainer
device: gpu
max_duration: 850000ba
eval_interval: 10000ba
device_train_microbatch_size: 16
run_name: ${name}
seed: ${seed}
load_path: # Path to checkpoint to resume training from
save_folder: # Insert path to save folder or bucket
save_interval: 10000ba
save_overwrite: true
autoresume: false
fsdp_config:
sharding_strategy: "SHARD_GRAD_OP"