torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

RFC for ckpt apis

Open wconstab opened this issue 10 months ago • 3 comments

Stack from ghstack (oldest at bottom):

  • -> #226

unsure how to proceed with the apis, but a few nice to haves are

1- specify load path separately from save path: you may want to load a "golden checkpoint" repeatedly and produce output checkpoints into isolated output dirs 2- enable/disable loading and saving of ckpts separately: you may want to use ckpt loading for initialization purposes, but not save ckpts. You may want to start from scratch, but save ckpts 3- specify a specific ckpt to load, not just the most recent one in a folder: you may want to test going back to a particular point and retraining, e.g. to match convergence over a recent period

wconstab avatar Apr 13 '24 00:04 wconstab

Agree that these are all nice to have, especially 1 and 2 I've also thought about! One more concern inspired by the mast experience: mast will repeatedly try launching the same job if previous runs failed, the current default way of specifying a single folder for both load and save is nice in that if the earlier run saved a checkpoint and failed, the later run would just try to resume from that checkpoint. We might want to keep this capability after API change.

tianyu-l avatar Apr 13 '24 00:04 tianyu-l

The RFC looks good to me. The only thing that I can think of is that when both save_folder and load_folder exists and they don't have the same parent folder, the step would start from 0 again.

wz337 avatar Apr 15 '24 19:04 wz337

We discussed offline that when training 'for real' on a cluster, the auto-restart behavior would be messed up if the load path points to another folder, so we need to revise the proposal to ensure it only loads from the load folder the first time and then always loads from the save folder. We could rename the load folder to the init ckpt or something

wconstab avatar Apr 17 '24 15:04 wconstab

We discussed offline that when training 'for real' on a cluster, the auto-restart behavior would be messed up if the load path points to another folder, so we need to revise the proposal to ensure it only loads from the load folder the first time and then always loads from the save folder. We could rename the load folder to the init ckpt or something

@wconstab From my understanding, your script is already handling the requirements for seed checkpoints here. As long as the seed checkpiont's folder name is the same as the checkpoint config used for training, we should be fine. So we can close the RFC.

wz337 avatar Jun 18 '24 21:06 wz337