torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Validation and early stopping during training

Open kinggongzilla opened this issue 9 months ago • 5 comments

Is there a way to evaluate the model performance during training on a validation dataset and only save a new checkpoint if it achieves lower validation loss?

kinggongzilla avatar Apr 26 '24 20:04 kinggongzilla

Hi @kinggongzilla, thanks for filing this issue!

Currently we only support an early stopping that's based on the # of steps taken in an epoch, i.e. you can set max_steps_per_epoch flag in the configuration to early stop your model based on a # of steps.

However this doesn't satisfy your use case of only early stopping / saving a checkpoint based on some validation results.

In training evaluation + stopping criteria based on evaluation is a large space we haven't looked deeply into, what do you folks think @ebsmothers @RdoubleA? I could see a future in which we allow users to specify a validation dataset or validation split, and incorporate validation metrics into our checkpointer for whether to save a checkpoint or not. This is definitely something we could look at enabling in the future if there's sufficient interest.

rohan-varma avatar Apr 26 '24 20:04 rohan-varma

Thanks for the quick reply! Being able to define a validation dataset and do early stoppingbased on the validation loss would definitely be super helpful.

kinggongzilla avatar Apr 26 '24 21:04 kinggongzilla

+1 this would be super useful.

optimass avatar May 06 '24 17:05 optimass

+1 Would be super useful!

Some-random avatar May 07 '24 05:05 Some-random

Thanks all for the comments. This feature (along with general validation loops) are fairly high on our wishlist right now. We still need to do a bit of design to make sure it's not too intrusive into our recipes, but definitely hear you on the need for this feature. We will keep you posted here!

ebsmothers avatar May 07 '24 16:05 ebsmothers