FLAML Feature request: support validation datasets in AutoML.retrain_from

Currently AutoML.fit() supports custom validation datasets passed using X_val and y_val but not with AutoML.retrain_from_log(). When I train my model with 'Xtrain' and 'y_train' ~~with the validation datasets~~ and use validation datasets for hyperparameter validation and save the logs, I expect to use the logs later to warm start the model. However, it is not possible as AutoML.retrain_from_log() doesn't support validation datasets.

Here's the discussion @sonichi and I had about it earlier this week.

Discussed in https://github.com/microsoft/FLAML/discussions/727

^{Originally posted by harshvardhaniimi September 11, 2022} Hi, I'm trying to retrain a model with LightGBM. Earlier, I trained my model for a period of time and would like the algorithm to pick up where it stopped. retrain_from_log allows that, provided I have the log files saved which I do. However, I'm having trouble using the validation dataset with X_val. I get the following error.

TypeError: fit() got an unexpected keyword argument 'X_val'

The code works well when I'm using automl.fit(); this error only occurs in automl.retrain_from_log(). Any help? Are X_val and Y_val not valid parameters in retrain_fromlog() but only in fit()?

Sep 18 '22 21:09 harshvardhaniimi

@harshvardhaniimi validation data are not used for training during AutoML.fit().

Sep 19 '22 02:09 sonichi

Can you clarify what's the role of validation dataset? I used to think that they're used to compute validation metrics during the model training process. Finally, when the process finds the best parameters, FLAML retrains the model on the full training dataset (which includes training and validation data), given retrain_full = True.

Sep 19 '22 12:09 harshvardhaniimi

Can you clarify what's the role of validation dataset? I used to think that they're used to compute validation metrics during the model training process. Finally, when the process finds the best parameters, FLAML retrains the model on the full training dataset (which includes training and validation data), given retrain_full = True.

You are right that they are used to compute validation metrics. However, user-provided validation data are not used for retraining. If users have not provided validation data, the validation data split from the training data will be used for retraining.

Sep 19 '22 16:09 sonichi

It'd be worth including that retrain_full = True does not include the user provided validation data in the docs.

Sep 26 '23 17:09 RossDeVito

Feature request: support validation datasets in AutoML.retrain_from_log()

Discussed in https://github.com/microsoft/FLAML/discussions/727