FLAML
FLAML copied to clipboard
how to pass categorical features names or indices to learner
Dear, first of all thanks so much again for You awesome project and support.
I have thought to increase score best learner passing the features categorical information to the learner capable to manage categorical features. I rearrange Your extra argument example:
from flaml.data import load_openml_dataset from flaml import AutoML import numpy as np
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir="./")
cat_feat_index_list = np.where(X_train.dtypes != float) [0]
cat_feat_index_list = cat_feat_index_list.astype(np.int32)
cat_feat_names_list = X_train.iloc[:,cat_feat_index_list].columns.to_list()
automl = AutoML()
automl_settings = {
"task": "classification",
"time_budget": 60,
"estimator_list": "auto",
"fit_kwargs_by_estimator": {
"catboost": {
"cat_features": cat_feat_index_list,
}
},
}
automl.fit(X_train=X_train, y_train=y_train, **automl_settings)
but get the following error:
TypeError: catboost.core.CatBoostClassifier.fit() got multiple values for keyword argument 'cat_features'.
I tried a lot workarounds but nothing.......no way.
Please could You kindly help me.
Thanks in advance.
luigi
Please could You kindly reply also usefull information about categorical features in flaml?!? how to manage? how to pass? other...thanks again
luigi
Have you tried to use pandas dataframe? The categorical features are supposed to be recognizable by flaml when using dataframe. And you don't need to pass that info to flaml. To support numpy array with categorical features, additional work is needed.
Dear thanks so much: so no problem! I made the question because I use dataframe with many categorical features.......but when flaml converged and find the best model as xgboost, i notice that the enable_categorical flag is disable?!? Then i supposed that flaml not recognized the categorical!?!?
- Is there a way to verify that the categorical feature are recognized as real category, and used as categorical during the train?
- what about xgboost flag enable_categorical? how to set/verify the correct functionality of flaml regarding it?
Best regards.....
luigi
Hi @luigif2000 !
The enable_categorical
flag is actually an experimental feature of xgboost and we don't support it yet. However FLAML (including its XGBoost wrapper) handles categorical features just fine, it just auto-preprocesses them, like here
If you want to double-check what's going on inside, you could run from source and place a breakpoint here.
Does that help?
Hello @ZmeiGorynych and @sonichi !
I just came up with the same questioning as @luigif2000 regarding how FLAML handles categorical features. I noticed two things here:
- for
SKLearnEstimator
andLGBMEstimator
estimators, current approach is to extract the numerical codes fromcategory
type pandas series, usingapply(lambda x: x.cat.codes)
in the_preprocess
method, as you can see here and here, respectively
In the end of the day, this is a conversion from str
to int
(pretty much like an ordinal encoder), but no information about which features are in fact categorical are being passed to fit parameters of the above estimators. Thus, in the training phase, the categorical features are being treated as regular numerical features, without leveraging the capabilities of the algorithms to internally handle categorical features properly.
- CatBoostEstimator is the only one that is using the functionality of handling categorical features internally correctly, since the
cat_features
parameter is being created and passed to fit, as you can see here.
currently, both LGBMClassifier and XGBClassifier automatically detect and handle categorical features without the need to explicitly specify them as long as the dataframe is a pandas one and they are of the category
type, as you can check in the official documentation here for LGBM and here for XGB (for this one, in addition to category type, the parameter enable_categorical
must be set to True
)
a possible solution for that would be not converting from category
to int
for LGBM and XGBoost so they can handle categorical features automatically.
PS.: LGBMClassifier can also takes a list of features name to identify categorical features as catboost does.
Best regards,
Pedro