torchtune Validation and early stopping during training

Validation and early stopping during training

Open kinggongzilla opened this issue 9 months ago • 5 comments

Is there a way to evaluate the model performance during training on a validation dataset and only save a new checkpoint if it achieves lower validation loss?

Apr 26 '24 20:04 kinggongzilla

Hi @kinggongzilla, thanks for filing this issue!

Currently we only support an early stopping that's based on the # of steps taken in an epoch, i.e. you can set max_steps_per_epoch flag in the configuration to early stop your model based on a # of steps.

However this doesn't satisfy your use case of only early stopping / saving a checkpoint based on some validation results.

In training evaluation + stopping criteria based on evaluation is a large space we haven't looked deeply into, what do you folks think @ebsmothers @RdoubleA? I could see a future in which we allow users to specify a validation dataset or validation split, and incorporate validation metrics into our checkpointer for whether to save a checkpoint or not. This is definitely something we could look at enabling in the future if there's sufficient interest.

Apr 26 '24 20:04 rohan-varma

Thanks for the quick reply! Being able to define a validation dataset and do early stoppingbased on the validation loss would definitely be super helpful.

Apr 26 '24 21:04 kinggongzilla

+1 this would be super useful.

May 06 '24 17:05 optimass

+1 Would be super useful!

May 07 '24 05:05 Some-random

Thanks all for the comments. This feature (along with general validation loops) are fairly high on our wishlist right now. We still need to do a bit of design to make sure it's not too intrusive into our recipes, but definitely hear you on the need for this feature. We will keep you posted here!

May 07 '24 16:05 ebsmothers

torchtune torchtune copied to clipboard

Validation and early stopping during training

torchtune
torchtune copied to clipboard