tpot
tpot copied to clipboard
Question - Why class imbalance parameters are not considered during optimization?
I have a highly skewed dataset. I was hoping that Tpot will make use of parameters like class_weight
(in logistic regression) and scale_pos_weight
(Xgboost) but it doesn't recognize the class distribution
[Am I missing anything here? I tried with 90:10 ratio dataset and was expecting that TPOT
will automatically recognize my class distribution and identify a parameter suitable to capture the class distribution. May I check on why doesn't TPOT do that? ]
For example, the below was the pipeline returned by TPOT for a heavily imbalanced dataset. You can see that it didn't consider class_weight
parameter and set it to balanced
.
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['target'], random_state=None)
# Average CV score on the training set was: 0.40668814943177767
exported_pipeline = LogisticRegression()
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
You can see in the screenshot below that for Xgboost classifier it hasn't consider the scale_pos_weight
parameter. I would have expected to see TPOT use that. can help me understand why doesn't TPOT use that parameter?
Hi @Ak784, the reason that TPOT does not consider these parameters is because they are not included as parameters for TPOT to optimize or pass in the TPOT configuration file under that operator. Any parameters not included are considered to be the default for that operator.
The configuration for XGBClassifier, as an example taken from the classifier.py
dictionary is found below. As you can see, it does not include the parameter that you are looking for.
'xgboost.XGBClassifier': {
'n_estimators': [100],
'max_depth': range(1, 11),
'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.],
'subsample': np.arange(0.05, 1.01, 0.05),
'min_child_weight': range(1, 21),
'n_jobs': [1],
'verbosity': [0]
}
An alternative version with the scale_pos_weight
parameter set to 9 can be found below.
'xgboost.XGBClassifier': {
'n_estimators': [100],
'max_depth': range(1, 11),
'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.],
'subsample': np.arange(0.05, 1.01, 0.05),
'min_child_weight': range(1, 21),
'n_jobs': [1],
'verbosity': [0],
'scale_pos_weight': [9]
}
You should be able to create your own custom configurations for the classifiers/operators that you need to add additional parameters for, including the LogisticRegression() operator. Information on this can be found here: https://epistasislab.github.io/tpot/using/#customizing-tpots-operators-and-parameters
In addition to the comment above, there is currently active development on a version of TPOT that supports imblearn (imbalanced learn) and its related operators that may help with imbalanced data in the future. See #547 and #1137 for information.