imbalanced-learn
imbalanced-learn copied to clipboard
Combine SMOTENC and TomekLink and Classifier together in a pipeline for Mixed Datatype Datasets
Description:
I have a dataset which contains both numeric and categorical variables. I want to combine oversampling and under-sampling together. SMOTEOMEK is only applicable to pure numeric dataset.
Code Snippet:
model_oversampler_smotenc = make_pipeline(
SMOTENC(random_state=44, categorical_features= category_cols),
TomekLinks(sampling_strategy='auto'),
GradientBoostingClassifier())
scoring=['balanced_accuracy', 'f1', 'precision', 'recall']
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=3)
cv_results_oversampler_smotenc = cross_validate(
model_oversampler_smotenc, data_train , target_train, scoring=scoring,
return_train_score=True, return_estimator=True, cv=cv,
n_jobs=-1)
print(
f"Balanced accuracy mean +/- std. dev.: "
f"{cv_results_oversampler_smotenc['test_balanced_accuracy'].mean():.3f} +/- "
f"{cv_results_oversampler_smotenc['test_balanced_accuracy'].std():.3f}"
Questions:
- Is this the right approach ? If yes, can i also use other under-samplers in the pipeline ?
- The code runs without any error but i want to know the underlying process ?
- If this logic is wrong, is there any
alternative?