torchtitan
torchtitan copied to clipboard
RFC for ckpt apis
Stack from ghstack (oldest at bottom):
- -> #226
unsure how to proceed with the apis, but a few nice to haves are
1- specify load path separately from save path: you may want to load a "golden checkpoint" repeatedly and produce output checkpoints into isolated output dirs 2- enable/disable loading and saving of ckpts separately: you may want to use ckpt loading for initialization purposes, but not save ckpts. You may want to start from scratch, but save ckpts 3- specify a specific ckpt to load, not just the most recent one in a folder: you may want to test going back to a particular point and retraining, e.g. to match convergence over a recent period
Agree that these are all nice to have, especially 1 and 2 I've also thought about! One more concern inspired by the mast experience: mast will repeatedly try launching the same job if previous runs failed, the current default way of specifying a single folder for both load and save is nice in that if the earlier run saved a checkpoint and failed, the later run would just try to resume from that checkpoint. We might want to keep this capability after API change.
The RFC looks good to me. The only thing that I can think of is that when both save_folder and load_folder exists and they don't have the same parent folder, the step would start from 0 again.
We discussed offline that when training 'for real' on a cluster, the auto-restart behavior would be messed up if the load path points to another folder, so we need to revise the proposal to ensure it only loads from the load folder the first time and then always loads from the save folder. We could rename the load folder to the init ckpt or something
We discussed offline that when training 'for real' on a cluster, the auto-restart behavior would be messed up if the load path points to another folder, so we need to revise the proposal to ensure it only loads from the load folder the first time and then always loads from the save folder. We could rename the load folder to the init ckpt or something
@wconstab From my understanding, your script is already handling the requirements for seed checkpoints here. As long as the seed checkpiont's folder name is the same as the checkpoint config used for training, we should be fine. So we can close the RFC.