imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

Combine SMOTENC and TomekLink and Classifier together in a pipeline for Mixed Datatype Datasets

Open Sehjbir opened this issue 9 months ago • 0 comments

Description:

I have a dataset which contains both numeric and categorical variables. I want to combine oversampling and under-sampling together. SMOTEOMEK is only applicable to pure numeric dataset.

Code Snippet:

model_oversampler_smotenc = make_pipeline(
    SMOTENC(random_state=44, categorical_features= category_cols),
    TomekLinks(sampling_strategy='auto'),
    GradientBoostingClassifier())

scoring=['balanced_accuracy', 'f1', 'precision', 'recall']
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=3)
cv_results_oversampler_smotenc = cross_validate(
    model_oversampler_smotenc, data_train , target_train, scoring=scoring,
    return_train_score=True, return_estimator=True, cv=cv,
    n_jobs=-1)

print(
    f"Balanced accuracy mean +/- std. dev.: "
    f"{cv_results_oversampler_smotenc['test_balanced_accuracy'].mean():.3f} +/- "
    f"{cv_results_oversampler_smotenc['test_balanced_accuracy'].std():.3f}"

Questions:

  • Is this the right approach ? If yes, can i also use other under-samplers in the pipeline ?
  • The code runs without any error but i want to know the underlying process ?
  • If this logic is wrong, is there any alternative?

Sehjbir avatar May 22 '24 21:05 Sehjbir