ray icon indicating copy to clipboard operation
ray copied to clipboard

[Ray Train] - Add Options to Save Last checkpoint in Ray Train Checkpointing Config

Open kamal-rahimi opened this issue 2 years ago • 3 comments

Description

The checkpoining in Ray Train (CheckpointConfig) currently has the following options:

num_to_keep
checkpoint_score_attribute
checkpoint_score_order
checkpoint_frequency
checkpoint_at_end

It will be highly useful to add an option to keep the last_checkpoint in addition to num_to_keep.

Use case

In many scenarios, it is desired to keep the checkpoints with best metric. However, when training is interrupted (such as when there is only one worker spot instance and it gets terminated), it is required to restore from the latest checkpoint not the best one that is saved.

kamal-rahimi avatar Oct 19 '23 21:10 kamal-rahimi

Hi @kamal-rahimi,

Actually the default behavior should be to keep the last checkpoint, for the exact reasons you've mentioned. Are you observing something different?

Example:


import ray
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, RunConfig, Checkpoint, CheckpointConfig

def train_func():
    for i in range(10):
        checkpoint = Checkpoint.from_directory(".") 
        ray.train.report({"i": i}, checkpoint = checkpoint)

scaling_config = ScalingConfig(num_workers=2)
checkpoint_config = CheckpointConfig(checkpoint_score_attribute="i", checkpoint_score_order="min", num_to_keep=2)
run_config = RunConfig(storage_path = "/tmp/storage", name="experiment", checkpoint_config=checkpoint_config)
torch_trainer = TorchTrainer(train_func, scaling_config=scaling_config, run_config=run_config)

torch_trainer.fit()
$ tree /tmp/storage/experiment
/tmp/storage/experiment
├── TorchTrainer_50ef4_00000_0_2023-11-07_15-49-40
│   ├── checkpoint_000000
│   │   ├── events.out.tfevents.1699400984.g-c83faa40fb9f40001
│   │   ├── params.json
│   │   ├── params.pkl
│   │   └── result.json
│   ├── checkpoint_000001
│   │   ├── events.out.tfevents.1699400984.g-c83faa40fb9f40001
│   │   ├── params.json
│   │   ├── params.pkl
│   │   └── result.json
│   ├── checkpoint_000009
│   │   ├── events.out.tfevents.1699400984.g-c83faa40fb9f40001
│   │   ├── params.json
│   │   ├── params.pkl
│   │   ├── progress.csv
│   │   └── result.json
│   ├── events.out.tfevents.1699400984.g-c83faa40fb9f40001
│   ├── params.json
│   ├── params.pkl
│   ├── progress.csv
│   └── result.json
├── basic-variant-state-2023-11-07_15-49-40.json
├── experiment_state-2023-11-07_15-49-40.json
├── trainer.pkl
└── tuner.pkl

matthewdeng avatar Nov 07 '23 23:11 matthewdeng

Hi @matthewdeng , Thank you for looking into this issue and the information.

Yes, by default the latest checkpoint is preserved. However, what I mean is an option to keep last checkpoint always in addition to the best checkpoint that is decided based on the monitor metric (checkpoint_score_attribute). For example Lightning checkpointing provides similar option.

kamal-rahimi avatar Nov 08 '23 01:11 kamal-rahimi

I see that there is a checkpoint_at_end option now but with the TorchTrainer (and using trainer = ray_lightning.prepare_trainer(trainer)) I get:

ValueError: You passed checkpoint_at_end=True to your CheckpointConfig, but this trainer does not support this argument. If you passed in a Trainer that takes in a custom training loop, you should include one last call to ray.train.report(metrics=..., checkpoint=...) at the end of your training loop to get this behavior.

sgerber-hf avatar Apr 29 '24 19:04 sgerber-hf