ray
ray copied to clipboard
[Ray Train] - Add Options to Save Last checkpoint in Ray Train Checkpointing Config
Description
The checkpoining in Ray Train (CheckpointConfig) currently has the following options:
num_to_keep
checkpoint_score_attribute
checkpoint_score_order
checkpoint_frequency
checkpoint_at_end
It will be highly useful to add an option to keep the last_checkpoint in addition to num_to_keep.
Use case
In many scenarios, it is desired to keep the checkpoints with best metric. However, when training is interrupted (such as when there is only one worker spot instance and it gets terminated), it is required to restore from the latest checkpoint not the best one that is saved.
Hi @kamal-rahimi,
Actually the default behavior should be to keep the last checkpoint, for the exact reasons you've mentioned. Are you observing something different?
Example:
import ray
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, RunConfig, Checkpoint, CheckpointConfig
def train_func():
for i in range(10):
checkpoint = Checkpoint.from_directory(".")
ray.train.report({"i": i}, checkpoint = checkpoint)
scaling_config = ScalingConfig(num_workers=2)
checkpoint_config = CheckpointConfig(checkpoint_score_attribute="i", checkpoint_score_order="min", num_to_keep=2)
run_config = RunConfig(storage_path = "/tmp/storage", name="experiment", checkpoint_config=checkpoint_config)
torch_trainer = TorchTrainer(train_func, scaling_config=scaling_config, run_config=run_config)
torch_trainer.fit()
$ tree /tmp/storage/experiment
/tmp/storage/experiment
├── TorchTrainer_50ef4_00000_0_2023-11-07_15-49-40
│ ├── checkpoint_000000
│ │ ├── events.out.tfevents.1699400984.g-c83faa40fb9f40001
│ │ ├── params.json
│ │ ├── params.pkl
│ │ └── result.json
│ ├── checkpoint_000001
│ │ ├── events.out.tfevents.1699400984.g-c83faa40fb9f40001
│ │ ├── params.json
│ │ ├── params.pkl
│ │ └── result.json
│ ├── checkpoint_000009
│ │ ├── events.out.tfevents.1699400984.g-c83faa40fb9f40001
│ │ ├── params.json
│ │ ├── params.pkl
│ │ ├── progress.csv
│ │ └── result.json
│ ├── events.out.tfevents.1699400984.g-c83faa40fb9f40001
│ ├── params.json
│ ├── params.pkl
│ ├── progress.csv
│ └── result.json
├── basic-variant-state-2023-11-07_15-49-40.json
├── experiment_state-2023-11-07_15-49-40.json
├── trainer.pkl
└── tuner.pkl
Hi @matthewdeng , Thank you for looking into this issue and the information.
Yes, by default the latest checkpoint is preserved. However, what I mean is an option to keep last checkpoint always in addition to the best checkpoint that is decided based on the monitor metric (checkpoint_score_attribute). For example Lightning checkpointing provides similar option.
I see that there is a checkpoint_at_end option now but with the TorchTrainer (and using trainer = ray_lightning.prepare_trainer(trainer)) I get:
ValueError: You passed
checkpoint_at_end=Trueto your CheckpointConfig, but this trainer does not support this argument. If you passed in a Trainer that takes in a custom training loop, you should include one last call toray.train.report(metrics=..., checkpoint=...)at the end of your training loop to get this behavior.