composer icon indicating copy to clipboard operation
composer copied to clipboard

num_checkpoints_to_keep of CheckpointSaver doesn't work when the training is resumed

Open RolandGao opened this issue 2 years ago • 4 comments
trafficstars

** Environment ** PyTorch information

PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final) CMake version: version 3.10.2 Libc version: glibc-2.27

Python version: 3.8.3 (default, Jul 2 2020, 16:21:59) [GCC 7.3.0] (64-bit runtime) Python platform: Linux-5.4.0-131-generic-x86_64-with-glibc2.10 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Quadro RTX 6000 Nvidia driver version: 470.141.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.22.4 [pip3] pytorch-ranger==0.1.1 [pip3] torch==1.13.1 [pip3] torch-optimizer==0.3.0 [pip3] torchmetrics==0.11.3 [pip3] torchvision==0.14.1 [conda] Could not collect

Composer information

Composer version: 0.13.2 Composer commit hash: None Host processor model name: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz Host processor core count: 20 Number of nodes: 1 Accelerator model name: Quadro RTX 6000 Accelerators per node: 1 CUDA Device Count: 1

** To reproduce

Steps to reproduce the behavior:

  1. Set num_checkpoints_to_keep=1 in the CheckpointSaver callback instance
  2. Start the training. On the second epoch, stop the training. Then resume again from the saved checkpoint
  3. After the second epoch finishes, checkpoints from both the first and second epochs are in the checkpoint folder. But only the last epoch checkpoint should be there because num_checkpoints_to_keep=1.

Expected behavior

Only the checkpoints of the last num_checkpoints_to_keep epochs should be in the checkpoint folder.

Additional context

CheckpointSaver._rotate_checkpoints() needs self.saved_checkpoints to work, but self.saved_checkpoints is not saved into the checkpoint file, so when the training resumes the self.saved_checkpoints is empty again. I think self.saved_checkpoints needs to be saved into the checkpoint, it can reload self.saved_checkpoints when training resumes.

RolandGao avatar Mar 22 '23 20:03 RolandGao

@RolandGao I believe the intention with num_checkpoints_to_keep was to be set per run. The reasoning here was that if a run is resumed, we don't want to delete previous checkpoints since they are not "owned" by the active run and might lead to confusing user behavior. In short, we tend to defer to being less aggressive when deleting checkpoints because accidentally deleting a checkpoint is a very bad outcome.

With that said, would you mind elaborating on your use case? Maybe there's a better design that can satisfy what you are doing as well.

CC: @eracah

mvpatel2000 avatar Mar 24 '23 16:03 mvpatel2000

I see your point. It would be bad to lose a checkpoint one wants.

I train my models on a slurm server that often preempts and requeues my jobs, so I use MosaicML's autoresume feature. My disk space is limited, so I set num_checkpoints_to_keep=1. When a job finish, I would often realize that there are 10 checkpoints, probably because the job got preempted and requeued 10 times. This costs 10 times as much storage space, so I would have to manually delete previous checkpoints when I'm super low on disk space.

RolandGao avatar Mar 24 '23 18:03 RolandGao

Hm... I see. In this case, I would recommend adding your own callback that deletes an older checkpoint whenever a newer version gets written for now. In the mean time, we'll see if there's a better way to support this on our end, possibly with a flag.

mvpatel2000 avatar Mar 24 '23 19:03 mvpatel2000

Thanks! I will try writing a callback myself.

RolandGao avatar Mar 28 '23 03:03 RolandGao