FLAML icon indicating copy to clipboard operation
FLAML copied to clipboard

how to pass categorical features names or indices to learner

Open luigif2000 opened this issue 2 years ago • 5 comments

Dear, first of all thanks so much again for You awesome project and support.

I have thought to increase score best learner passing the features categorical information to the learner capable to manage categorical features. I rearrange Your extra argument example:

from flaml.data import load_openml_dataset from flaml import AutoML import numpy as np

X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir="./")

cat_feat_index_list = np.where(X_train.dtypes != float) [0] cat_feat_index_list = cat_feat_index_list.astype(np.int32)
cat_feat_names_list = X_train.iloc[:,cat_feat_index_list].columns.to_list()

automl = AutoML() automl_settings = { "task": "classification", "time_budget": 60, "estimator_list": "auto", "fit_kwargs_by_estimator": { "catboost": { "cat_features": cat_feat_index_list,
} }, } automl.fit(X_train=X_train, y_train=y_train, **automl_settings)

but get the following error:

TypeError: catboost.core.CatBoostClassifier.fit() got multiple values for keyword argument 'cat_features'.

I tried a lot workarounds but nothing.......no way.

Please could You kindly help me.

Thanks in advance.

luigi

luigif2000 avatar Nov 08 '22 13:11 luigif2000

Please could You kindly reply also usefull information about categorical features in flaml?!? how to manage? how to pass? other...thanks again

luigi

luigif2000 avatar Nov 08 '22 13:11 luigif2000

Have you tried to use pandas dataframe? The categorical features are supposed to be recognizable by flaml when using dataframe. And you don't need to pass that info to flaml. To support numpy array with categorical features, additional work is needed.

sonichi avatar Nov 08 '22 22:11 sonichi

Dear thanks so much: so no problem! I made the question because I use dataframe with many categorical features.......but when flaml converged and find the best model as xgboost, i notice that the enable_categorical flag is disable?!? Then i supposed that flaml not recognized the categorical!?!?

  1. Is there a way to verify that the categorical feature are recognized as real category, and used as categorical during the train?
  2. what about xgboost flag enable_categorical? how to set/verify the correct functionality of flaml regarding it?

Best regards.....

luigi

luigif2000 avatar Nov 09 '22 17:11 luigif2000

Hi @luigif2000 !

The enable_categorical flag is actually an experimental feature of xgboost and we don't support it yet. However FLAML (including its XGBoost wrapper) handles categorical features just fine, it just auto-preprocesses them, like here

If you want to double-check what's going on inside, you could run from source and place a breakpoint here.

Does that help?

ZmeiGorynych avatar Jan 20 '23 18:01 ZmeiGorynych

Hello @ZmeiGorynych and @sonichi !

I just came up with the same questioning as @luigif2000 regarding how FLAML handles categorical features. I noticed two things here:

  1. for SKLearnEstimator and LGBMEstimator estimators, current approach is to extract the numerical codes from category type pandas series, using apply(lambda x: x.cat.codes) in the _preprocess method, as you can see here and here, respectively

In the end of the day, this is a conversion from str to int (pretty much like an ordinal encoder), but no information about which features are in fact categorical are being passed to fit parameters of the above estimators. Thus, in the training phase, the categorical features are being treated as regular numerical features, without leveraging the capabilities of the algorithms to internally handle categorical features properly.

  1. CatBoostEstimator is the only one that is using the functionality of handling categorical features internally correctly, since the cat_features parameter is being created and passed to fit, as you can see here.

currently, both LGBMClassifier and XGBClassifier automatically detect and handle categorical features without the need to explicitly specify them as long as the dataframe is a pandas one and they are of the category type, as you can check in the official documentation here for LGBM and here for XGB (for this one, in addition to category type, the parameter enable_categorical must be set to True)

a possible solution for that would be not converting from category to int for LGBM and XGBoost so they can handle categorical features automatically.

PS.: LGBMClassifier can also takes a list of features name to identify categorical features as catboost does.

Best regards,

Pedro

phgui avatar Nov 16 '23 17:11 phgui