torchtune No param to control save checkpoints every N steps ?

I can't find param to control save checkpoints every N steps, too many checkpoint files in training porcess anothere suggestion , maybe whole model checkpoint (merged lora checkpiont model) should only output in last train step will better ?

May 16 '24 10:05 apachemycat

INFO:torchtune.utils.logging:Learning rate scheduler is initialized. 1|2|Loss: 3.4330687522888184: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.53s/it] INFO:torchtune.utils.logging:Model checkpoint of size 16.06 GB saved to /tmp/meta-Llama-3-8B-original/meta_model_0.pt INFO:torchtune.utils.logging:Adapter checkpoint of size 0.04 GB saved to /tmp/meta-Llama-3-8B-original/adapter_0.pt INFO:torchtune.utils.logging:Recipe checkpoint of size 0.00 GB saved to /tmp/meta-Llama-3-8B-original/recipe_state.pt 2|2|Loss: 4.366740703582764: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.98s/it]

merged model checkpoint file too large

May 16 '24 10:05 apachemycat

Hey @apachemycat, an option to save only the trainable weights for intermediate checkpoints is a great idea! We will add support for this soon.

Regarding checkpointing every N steps, this will require more planning because we would need to save the state of the data loader (i.e. which samples have been iterated over) in order to cleanly resume training from one of these checkpoints.

Another option we could consider is making the number of checkpoints that will persist on disk configurable (independent of how frequently the latest checkpoint is saved). Is something you'd find useful?

Thanks for raising this issue!

May 16 '24 17:05 calvinpelletier

1 making the number of checkpoints that will persist on disk configurable
2 auto compare current loss with prev trained loss ,if current loss is less than prev , then remove prev whole checkpoint ,save current whole checkpoint , maybe choise 2 is better ?

May 18 '24 07:05 apachemycat

A potential solution is being discussed in #1107

Aug 21 '24 18:08 RdoubleA