tpot icon indicating copy to clipboard operation
tpot copied to clipboard

Question - Why class imbalance parameters are not considered during optimization?

Open Ak784 opened this issue 3 years ago • 2 comments

I have a highly skewed dataset. I was hoping that Tpot will make use of parameters like class_weight (in logistic regression) and scale_pos_weight (Xgboost) but it doesn't recognize the class distribution

[Am I missing anything here? I tried with 90:10 ratio dataset and was expecting that TPOT will automatically recognize my class distribution and identify a parameter suitable to capture the class distribution. May I check on why doesn't TPOT do that? ]

For example, the below was the pipeline returned by TPOT for a heavily imbalanced dataset. You can see that it didn't consider class_weight parameter and set it to balanced.

tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=None)

# Average CV score on the training set was: 0.40668814943177767
exported_pipeline = LogisticRegression()

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

You can see in the screenshot below that for Xgboost classifier it hasn't consider the scale_pos_weight parameter. I would have expected to see TPOT use that. can help me understand why doesn't TPOT use that parameter?

image

Ak784 avatar Jun 04 '21 12:06 Ak784

Hi @Ak784, the reason that TPOT does not consider these parameters is because they are not included as parameters for TPOT to optimize or pass in the TPOT configuration file under that operator. Any parameters not included are considered to be the default for that operator.

The configuration for XGBClassifier, as an example taken from the classifier.py dictionary is found below. As you can see, it does not include the parameter that you are looking for.

'xgboost.XGBClassifier': {
        'n_estimators': [100],
        'max_depth': range(1, 11),
        'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.],
        'subsample': np.arange(0.05, 1.01, 0.05),
        'min_child_weight': range(1, 21),
        'n_jobs': [1],
        'verbosity': [0]
    }

An alternative version with the scale_pos_weight parameter set to 9 can be found below.

'xgboost.XGBClassifier': {
        'n_estimators': [100],
        'max_depth': range(1, 11),
        'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.],
        'subsample': np.arange(0.05, 1.01, 0.05),
        'min_child_weight': range(1, 21),
        'n_jobs': [1],
        'verbosity': [0],
        'scale_pos_weight': [9]
    }

You should be able to create your own custom configurations for the classifiers/operators that you need to add additional parameters for, including the LogisticRegression() operator. Information on this can be found here: https://epistasislab.github.io/tpot/using/#customizing-tpots-operators-and-parameters

rachitk avatar Jul 01 '21 17:07 rachitk

In addition to the comment above, there is currently active development on a version of TPOT that supports imblearn (imbalanced learn) and its related operators that may help with imbalanced data in the future. See #547 and #1137 for information.

rachitk avatar Jul 01 '21 17:07 rachitk