tpot copied to clipboard
Imbalanced Learn?
I was wondering how possible it would be to incorporate the sampling preprocessors in Imbalanced learn?
I have had a cast around the tpot code but unfortunately can't quite figure out how it hangs together enough to know how painful this would be (even just for myself hacking it in!)
If this is of interest/possible I would have a proper go at incorporating it.
I tried to use config_dict
for incorporating imblalanced-learn
with the codes below:
# with imbalanced-learn-0.3.0.dev0
from sklearn.datasets import load_iris
from imblearn.datasets import make_imbalance
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier
ratio = {0: 10, 1: 20, 2: 30}
iris = load_iris()
X, y = make_imbalance(,, ratio=ratio)
tpot_config = {
'sklearn.naive_bayes.BernoulliNB': {
'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
'fit_prior': [True, False]
'imblearn.under_sampling.RandomUnderSampler': {
'ratio': ['minority', 'majority', 'all'],
'replacement': [True, False]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25)
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=3,
config_dict=tpot_config, random_state=42), y_train)
print(tpot.score(X_test, y_test))
However, I got a lot error messages like:
All intermediate steps should be transformers and implement fit and transform.
'RandomUnderSampler(random_state=None, ratio='minority', replacement=True, return_indices=False)'
(type <class 'imblearn.under_sampling.prototype_selection.random_under_sampler.RandomUnderSampler'>) doesn't
Maybe we need wrap the imblalanced-learn
object as subclass of sklearn.base.TransformerMixin and add implementation of transform
It seems if you are using pipelines than imbalanced-learn comes with it's own implementation,imblearn.pipeline.Pipeline which has a bunch of extra functions to do with transforming and sampling. Looks to be to do with supporting having a different number of examples through a pipeline, rather than just different features. Probably only makes sense for them to be at the start of the pipeline too, and unsure how to enforce that.
However, I have used an undersampler of imblearn with success:
'imblearn.under_sampling.TomekLinks': {
@saddy001 Could you please let me know how to use BalancedBaggingClassifier
? Especially, if xgboost
is to be selected and tuned as the base estimator? Thanks
I guess no progress has been made? The difficulty here is that ImbLearn applies fit
and sample
, notice the latter is not transform
as it does not change features (transformations), only the re-samples (hence sampling).
For this reason, ImbLearn provides its own Pipeline
module, as it needs to wrap the sample
functionality in a way that makes sense (it only samples on training and not on testing, etc) and is compatible with SciKit-Learn API flow.
Since most real-life data is highly unbalanced, I think ImbLearn compatibility is highly desired.
Agreed, this feature would be super useful.
I'm also trying to integrate imblearn with TPOT and have made a number of code changes to try and make it happen. After making changes in what seemed like the obvious places I'm now met with an error which I'm not sure how to deal with.
Any advice would be much appreciated!
~\anaconda3\envs\env\lib\site-packages\tpot\ in _update_top_pipeline(self)
837 error_score="raise")
838 break
--> 839 raise RuntimeError('There was an error in the TPOT optimization '
840 'process. This could be because the data was '
841 'not formatted properly, or because data for '
RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly. If you enabled PyTorch estimators, please check the data requirements in the online documentation:
Code Additions/Changes
Need to add a way to check that an object is a resampler
from imblearn.over_sampling import RandomOverSampler
def _is_resampler(estimator):
return hasattr(estimator, "fit_resample")
assert _is_resampler(RandomOverSampler)
I added _is_resampler
to then included it twice within TPOTOperatorClassFactory
at line 201
if is_classifier(op_obj):
class_profile["root"] = True
optype = "Classifier"
elif is_regressor(op_obj):
class_profile["root"] = True
optype = "Regressor"
elif _is_transformer(op_obj):
optype = "Transformer"
elif _is_selector(op_obj):
optype = "Selector"
elif _is_resampler(op_obj):
optype = "Resampler"
raise ValueError(
"optype must be one of: Classifier, Regressor, Transformer, Selector, Resampler"
and line 330
if inspect.isclass(doptype): # a estimator
if (
issubclass(doptype, BaseEstimator)
or is_classifier(doptype)
or is_regressor(doptype)
or _is_transformer(doptype)
or _is_resampler(doptype)
or issubclass(doptype, Kernel)
As raised by @ksyme99 the pipeline needs to be changed out for one from imblearn.
In I believe we just need to change the following line
from sklearn.pipeline import make_pipeline, make_union
from sklearn.pipeline import make_union
from imblearn.pipeline import make_pipeline
And in export_utils I believe we need to change
def _starting_imports(operators, operators_used):
if num_op_root > 1:
return {
'sklearn.model_selection': ['train_test_split'],
'sklearn.pipeline': ['make_pipeline', 'make_union'],
'tpot.builtins': ['StackingEstimator'],
elif num_op > 1:
return {
'sklearn.model_selection': ['train_test_split'],
'sklearn.pipeline': ['make_pipeline']
def _starting_imports(operators, operators_used):
if num_op_root > 1:
return {
'sklearn.model_selection': ['train_test_split'],
'sklearn.pipeline': ['make_union'],
'imblearn.pipeline': ['make_pipeline'],
'tpot.builtins': ['StackingEstimator'],
elif num_op > 1:
return {
'sklearn.model_selection': ['train_test_split'],
'imblearn.pipeline': ['make_pipeline']
@AyrtonB Could you please share the link of your branch with those changes and also provide a demo to reproduce the error? I can take a look.
The error appears to be specific to a custom component I'm using which requries the index of the passed data. In have this working in imblearn but trying to include this in TPOT was what broke it, one step at a time..
The good news is that I don't have this issue on standard imblearn components and the following will work if you use the fork I've made here, have just made a PR as well.
from tpot import TPOTClassifier
from sklearn.datasets import make_classification
classifier_config_dict = {
# Classifiers
'sklearn.ensemble.ExtraTreesClassifier': {
'n_estimators': [100],
'criterion': ["gini", "entropy"],
'max_features': np.arange(0.05, 1.01, 0.05),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21),
'bootstrap': [True, False]
# Preprocessors
'imblearn.over_sampling.RandomOverSampler': {
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
n_redundant=0, n_repeated=0, n_classes=3,
weights=[0.01, 0.05, 0.94],
class_sep=0.8, random_state=0)
pipeline_optimizer = TPOTClassifier(generations=5, population_size=10, cv=3,
random_state=42, verbosity=2, n_jobs=-1,
template='RandomOverSampler-Classifier'), y)
pipeline = pipeline_optimizer.fitted_pipeline_
Generation 1 - Current best internal CV score: 0.98940007916784
Generation 2 - Current best internal CV score: 0.98940007916784
Generation 3 - Current best internal CV score: 0.9896000391758383
Generation 4 - Current best internal CV score: 0.9896000391758383
Generation 5 - Current best internal CV score: 0.9897999991838367
Best pipeline: ExtraTreesClassifier(RandomOverSampler(input_matrix), bootstrap=False, criterion=gini, max_features=0.3, min_samples_leaf=1, min_samples_split=10, n_estimators=100)
Wall time: 28.4 s
Regarding the specific issue I'm encountering.
Goal: Want to be able to resample based on the group specified in a multi-index Current progress: Custom components work fine in a standard imblearn pipeline Current issue: The custom components break the TPOT regression optimisation
A dummy dataset can be created like this
from sklearn.datasets import make_regression
flatten = lambda t: [item for sublist in t for item in sublist]
months = flatten([[x]*100*x for x in range(1, 13)])
idx = pd.MultiIndex.from_arrays([range(len(months)), months], names=['unique', 'month'])
X, y = make_regression(n_samples=len(idx), n_features=20)
df_X, s_y = pd.DataFrame(X, index=idx), pd.Series(y, index=idx)
The components are defined in a script called
from sklearn.ensemble import RandomForestRegressor
from imblearn.over_sampling import RandomOverSampler
def add_series_index(idx_arg_pos=0):
def decorator(func):
def decorator_wrapper(*args, **kwargs):
input_s = args[idx_arg_pos]
assert isinstance(input_s, (pd.Series, pd.DataFrame))
result = pd.Series(func(*args, **kwargs), index=input_s.index)
return result
return decorator_wrapper
return decorator
class PandasRandomForestRegressor(RandomForestRegressor):
def __init__(self, n_estimators=100, *, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None, score_func=None):
super().__init__(n_estimators=n_estimators, criterion=criterion, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, min_weight_fraction_leaf=min_weight_fraction_leaf, max_features=max_features, max_leaf_nodes=max_leaf_nodes, min_impurity_decrease=min_impurity_decrease, min_impurity_split=min_impurity_split, bootstrap=bootstrap, oob_score=oob_score, n_jobs=n_jobs, random_state=random_state, verbose=verbose, warm_start=warm_start, ccp_alpha=ccp_alpha, max_samples=max_samples)
if score_func is None:
self.score_func = r2_score
self.score_func = score_func
def predict(self, X):
pred = super().predict(X)
return pred
def score(self, X, y, *args, **kwargs):
y_pred = self.predict(X)
score = self.score_func(y, y_pred, *args, **kwargs)
return score
def custom_resampler_helper(X, y, class_col, resample_func):
# Checking indexes match
assert X.index.equals(y.index), 'X and y indexes should be the same'
# Extracting idx names and mapping to y values
idx_names = X.index.names
idx_to_y = dict(zip(y.reset_index()[idx_names].apply(tuple, axis=1).values, y.values))
# Resampling values
classes = X.reset_index()[class_col]
X_resampled, _ = resample_func(X.reset_index(), classes)
y_resampled = X_resampled[idx_names].apply(tuple, axis=1).map(idx_to_y)
# Formatting indexes
X_resampled = X_resampled.set_index(idx_names)
y_resampled.index = X_resampled.index
return X_resampled, y_resampled
class XRandomOverSampler(RandomOverSampler):
def __init__(self, class_col, sampling_strategy='auto'):
self.class_col = class_col
def fit(self, X):
classes = X.reset_index()[self.class_col]
super().fit(X, classes)
def fit_resample(self, X, y):
return custom_resampler_helper(X, y, self.class_col, super().fit_resample)
def fit_sample(self, X, y):
return self.fit_resample(X, y)
I then create a test pipeline like so:
import operators
from imblearn.pipeline import Pipeline
pipeline = Pipeline([
('xros', operators.XRandomOverSampler('month')),
('pandas_RF', operators.PandasRandomForestRegressor(n_estimators=100, n_jobs=-1))
Which works with the standard sklearn fit/predict, s_y)
df_pred = pipeline.predict(df_X)
However it breaks with TPOT
regressor_config_dict = {
# Classifiers
'operators.PandasRandomForestRegressor': {
'n_estimators': [100],
'max_features': np.arange(0.05, 1.01, 0.05),
'min_samples_split': range(2, 21),
'min_samples_leaf': range(1, 21),
'bootstrap': [True, False]
# Preprocessors
'operators.XRandomOverSampler': {
'class_col': ['month']
pipeline_optimizer = TPOTRegressor(generations=5, population_size=10, cv=3,
random_state=42, verbosity=2, n_jobs=-1,
template='XRandomOverSampler-PandasRandomForestRegressor'), s_y)
For which I get this error
RuntimeError Traceback (most recent call last)
c:\path\to\tpot\tpot\ in fit(self, features, target, sample_weight, groups)
742 per_generation_function=self._check_periodic_pipeline,
--> 743 log_file=self.log_file_
744 )
c:\path\to\tpot\tpot\ in eaMuPlusLambda(population, toolbox, mu, lambda_, cxpb, mutpb, ngen, pbar, stats, halloffame, verbose, per_generation_function, log_file)
280 if per_generation_function is not None:
--> 281 per_generation_function(gen)
c:\path\to\tpot\tpot\ in _check_periodic_pipeline(self, gen)
1052 """
-> 1053 self._update_top_pipeline()
1054 if self.periodic_checkpoint_folder is not None:
c:\path\to\tpot\tpot\ in _update_top_pipeline(self)
838 break
--> 839 raise RuntimeError('There was an error in the TPOT optimization '
840 'process. This could be because the data was '
RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly. If you enabled PyTorch estimators, please check the data requirements in the online documentation: