torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

No param to control save checkpoints every N steps ?

Open apachemycat opened this issue 1 year ago • 4 comments

I can't find param to control save checkpoints every N steps, too many checkpoint files in training porcess anothere suggestion , maybe whole model checkpoint (merged lora checkpiont model) should only output in last train step will better ?

apachemycat avatar May 16 '24 10:05 apachemycat

INFO:torchtune.utils.logging:Learning rate scheduler is initialized. 1|2|Loss: 3.4330687522888184: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.53s/it] INFO:torchtune.utils.logging:Model checkpoint of size 16.06 GB saved to /tmp/meta-Llama-3-8B-original/meta_model_0.pt INFO:torchtune.utils.logging:Adapter checkpoint of size 0.04 GB saved to /tmp/meta-Llama-3-8B-original/adapter_0.pt INFO:torchtune.utils.logging:Recipe checkpoint of size 0.00 GB saved to /tmp/meta-Llama-3-8B-original/recipe_state.pt 2|2|Loss: 4.366740703582764: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.98s/it]

merged model checkpoint file too large

apachemycat avatar May 16 '24 10:05 apachemycat

Hey @apachemycat, an option to save only the trainable weights for intermediate checkpoints is a great idea! We will add support for this soon.

Regarding checkpointing every N steps, this will require more planning because we would need to save the state of the data loader (i.e. which samples have been iterated over) in order to cleanly resume training from one of these checkpoints.

Another option we could consider is making the number of checkpoints that will persist on disk configurable (independent of how frequently the latest checkpoint is saved). Is something you'd find useful?

Thanks for raising this issue!

calvinpelletier avatar May 16 '24 17:05 calvinpelletier

1 making the number of checkpoints that will persist on disk configurable
2 auto compare current loss with prev trained loss ,if current loss is less than prev , then remove prev whole checkpoint ,save current whole checkpoint , maybe choise 2 is better ?

apachemycat avatar May 18 '24 07:05 apachemycat

A potential solution is being discussed in #1107

RdoubleA avatar Aug 21 '24 18:08 RdoubleA