FLAML icon indicating copy to clipboard operation
FLAML copied to clipboard

Overfitting when using AutoML

Open leelew opened this issue 2 years ago • 3 comments

Hi,

We used FLAML to perform regression task, and found AutoML model was easy to be overfitted. However, in the same task, other ML models e.g., LightGBM, RF, could avoid overfitting by grid search best parameters. We tried add 'cv=5' into the AutoML model, but it did not work on our case.

So could you give me some suggestions on how to avoid overfitting when using FLAML AutoML models?

BTW: We also used flame.default.LGBMRegressor() to perform auto-search hyper-parameters of LightGBM model, but this model is still overfitting. But LightGBM model could be avoid overfitting by grid search methods. So I think maybe I misuse FLAML.

Lu Li


The code of FLAML AutoML models: from flaml import AutoML am = AutoML() am.fit(x_train, y_train, task="regression")

The performance on training data: 14031690032103_ pic

The performance on test data: 14041690032104_ pic

leelew avatar Jul 22 '23 13:07 leelew

By default, "r2" is used as the optimization metric for regression tasks. Looking at your plots, the model doesn't overfit the r2 or KGE metric. The model overfits RMSE. If you'd like to use RMSE as the optimize metric, please set metric="rmse".

sonichi avatar Jul 22 '23 15:07 sonichi

Hi Chi,

Thanks for your reply.

I think our model not only overfit RMSE, but also R2 and KGE (i.e., the performance on training data is much better than on test data). We will try to set metric=rmse, and set split_ratio=0.2.

The code is shown as: automl.fit(x_train, y_train, task = 'regression', metric = 'rmse', split_ratio=0.2, ensemble={ 'final_estimator': MLPRegressor(), 'passthrough': True }, time_budget=3600)

We will further contact you if this did not work. Thanks again for your help!

Best, Lu Li

leelew avatar Jul 23 '23 05:07 leelew

Hi Chi,

We set metric=rmse and used holdout strategy (split_ratio=0.2). However, we also found overfitting problem. Although we found AutoML could perform better than other ML models on test data, but the train performance of AutoML is much better than test performance.

Is there any further suggestion to avoid overfitting when using AutoML?

Best, Lu


The code is: automl.fit(x_train, y_train, task = 'regression', metric = 'rmse', split_ratio=0.2, ensemble={ 'final_estimator': LGBMRegressor(), 'passthrough': True }, time_budget=3600)

The train performance is: 8961690159313_ pic

The test performance is: 8971690159313_ pic

leelew avatar Jul 24 '23 01:07 leelew