h2o-llmstudio icon indicating copy to clipboard operation
h2o-llmstudio copied to clipboard

[FEATURE]when saving multiple epochs add an epoch number suffix for when save best=False

Open Quetzalcohuatl opened this issue 1 year ago • 3 comments

🚀 Feature

Saves multiple .pth on each checkpoint. Instead of overwriting every checkpoint.pth

Motivation

Often useful to see how model performs at each epoch/savepoint. For example when training llm, want to measure the generative capabilities after each epoch and see if it is improving

Quetzalcohuatl avatar Jan 31 '24 00:01 Quetzalcohuatl

Example: after epoch 1 it saves checkpoint_ep01.pth

after epoch 2 it saves checkpoint_ep02.pth

when loading mode back in according to config, it by default will load in sorted(glob(“checkpoint_ep*”))[-1] aka the last epoch to keep the behavior the same as it currently is

alternatively if save_best_only=true, then keep the current behavior of saving as checkpoint.pth ?

Quetzalcohuatl avatar Jan 31 '24 01:01 Quetzalcohuatl

We didnt do that by default as model weights take a ton of disk space.

We could theoretically make it a separate setting to additionally save all checkpoints, wdyt?

psinger avatar Jan 31 '24 09:01 psinger

We didnt do that by default as model weights take a ton of disk space.

We could theoretically make it a separate setting to additionally save all checkpoints, wdyt?

Most research papers are only training for 1 epoch, sometimes 2. If the user knows what theyre doing and wants to enable it, I think its a nice option. Especially since its a simple implementation.

Quetzalcohuatl avatar Jan 31 '24 10:01 Quetzalcohuatl