FLAML
FLAML copied to clipboard
n_estimators value on automl.model differs from value in logs (for CatBoost models)
Hi all,
The n_estimators value on the best model (automl.model) provided by FLAML does not seem to be set correctly for CatBoostClassifiers.
Example code here:
from flaml import AutoML
from sklearn import datasets
dic_data = datasets.load_iris(as_frame=True) # numpy arrays
iris_data = dic_data["frame"] # pandas dataframe data + target
automl = AutoML()
automl_settings = {
"max_iter":2,
"metric": 'accuracy',
"task": 'classification',
"log_file_name": "catboost_error.log",
"log_type": "all",
"estimator_list": ['catboost'],
"eval_method": "cv",
}
x_train = iris_data[["sepal length (cm)","sepal width (cm)", "petal length (cm)","petal width (cm)"]].to_numpy()
y_train = iris_data['target']
automl.fit(x_train, y_train, **automl_settings)
print(automl.model.get_params())
The print statement logs the following for me: {'early_stopping_rounds': 10, 'learning_rate': 0.09999999999999996, 'n_estimators': 33, 'thread_count': -1, 'verbose': False, 'random_seed': 10242048, 'task': <flaml.automl.task.generic_task.GenericTask object at 0x7f895f2b3830>, '_estimator_type': 'classifier'}
However, if I look into the actual [catboost_error.log], I can see that neither of the two estimators attempted had n_estimators = 33. They actually had n_estimators = 35 and n_estimators =57. Replicating the FLAML folds myself has shown that this n_estimators value should be 35, meaning that the logs are correct and automl.model is incorrect.
Furthermore, if I run print(automl.model.model.get_all_params()) I get a dictionary which includes iterations=35. The catboost documentation shows that iterations is an alias of n_estimators, and whilst I haven't managed to pin down the exact cause of this issue, I believe it's tied in somewhere here.
In terms of package versions, I'm using FLAML 2.1.2, catboost 1.2.5, scikit-learn 1.5.0 and Python 3.12.0
Hi, I will check through with this in the future but check #1275 discussion as well, it seems that they have come across the same issue... I will try and see through with what the issue is :) If anyone else can contribute of help out please do, thnx
I am getting the same problem with lgbm:
Best hyperparmeter config: {'n_estimators': 1314, 'num_leaves': 6376, 'min_child_samples': 38, 'learning_rate': 0.0988351059982288, 'log_max_bin': 9, 'colsample_bytree': 0.6663805206578503, 'reg_alpha': 0.001100862503118278, 'reg_lambda': 136.83211237673618}
Best r2 on validation data: 0.2975
Training duration of best run: 6365 s
LGBMRegressor(colsample_bytree=0.6663805206578503,
learning_rate=0.0988351059982288, max_bin=511,
min_child_samples=38, n_estimators=1, n_jobs=-1, num_leaves=6376,
reg_alpha=0.001100862503118278, reg_lambda=136.83211237673618,
verbose=-1)
Hi @dannycg1996 , @jmrichardson , for catboost, we always set the n_estimators to 8192 and apply early stop for the fit function. Early stop could be triggered in lgbm as well.
https://github.com/microsoft/FLAML/blob/5c0f18b7bc705befb7e5bd400d204e6be04640d9/flaml/automl/model.py#L1984-L1987
To get more determined result, we'll need to update the model settings. This is related to #1361 .