h2o-llmstudio
h2o-llmstudio copied to clipboard
[FEATURE]when saving multiple epochs add an epoch number suffix for when save best=False
🚀 Feature
Saves multiple .pth on each checkpoint. Instead of overwriting every checkpoint.pth
Motivation
Often useful to see how model performs at each epoch/savepoint. For example when training llm, want to measure the generative capabilities after each epoch and see if it is improving
Example: after epoch 1 it saves checkpoint_ep01.pth
after epoch 2 it saves checkpoint_ep02.pth
when loading mode back in according to config, it by default will load in sorted(glob(“checkpoint_ep*”))[-1] aka the last epoch to keep the behavior the same as it currently is
alternatively if save_best_only=true, then keep the current behavior of saving as checkpoint.pth ?
We didnt do that by default as model weights take a ton of disk space.
We could theoretically make it a separate setting to additionally save all checkpoints, wdyt?
We didnt do that by default as model weights take a ton of disk space.
We could theoretically make it a separate setting to additionally save all checkpoints, wdyt?
Most research papers are only training for 1 epoch, sometimes 2. If the user knows what theyre doing and wants to enable it, I think its a nice option. Especially since its a simple implementation.