tpot icon indicating copy to clipboard operation
tpot copied to clipboard

Run custom pipeline / feature extraction on each cv fold

Open MaxBenChrist opened this issue 3 years ago • 2 comments

Hi tpot team,

Thank you for this great library. I am currently using it on a few datasets and the results are great (especially if you tune the config to your problem).

Now, for a data set, I need to perform a feature extraction that I would consider sensitive to the samples in the train set. I have a custom pipeline to extract those features. In the following example that feature extraction pipeline is pipe. The steps of that pipeline are not important, those are custom transformers.

pipe = Pipeline([
    ('last_foo', AddLastFromGroup(),
    ('last_bar', AddLastFromGroup()),
    ('missing_indicator', AddMissingIndicator()),
    ('imputer_groups', GroupedImputer()),
    ('imputer_median', MeanMedianImputer()),
    ('imputer_categories', CategoricalImputer()),
    ('foo', DropFeatures()),
    ('baz', RareLabelEncoder()),
    ('bu', OneHotEncoder()),
])

X_train = pipe.fit_transform(df_train)
X_test = pipe.transform(df_test)

tpot = TPOTRegressor(cv=10)
tpot.fit(X_train, y_train)

The features calculated by pipe I put into tpot. However, when tpot runs a cross-validation as in tpot.fit(X_train, y_train), it creates a data leakage, because it uses the values of the features calculate on the whole train set df_train, so it uses samples from the cv test fold. This is a data leakage and is a problem for me as it overestimates the importance of certain features.

So, how can I run the pipe to create the features in each of the 10 cross-validation train and test folds inside tpot? Essentially, I want every tpot pipeline to start with pipe and then extend it by estimators, selectors and regressors. In that case, I would call tpot.fit(df_train, y_train) instead of tpot.fit(X_train, y_train).

I was thinking about using the template argument, I looked into the tpot source code but I am a little bit lost. Unfortunately, it is not that greatly documented, I guess you would have to somehow fix my pipe as the root of the tree in https://github.com/EpistasisLab/tpot/blob/master/tpot/base.py#L444:L508?

Finally, where can I find a description of the genetic algorithm that is used in tpot?

MaxBenChrist avatar Mar 04 '21 21:03 MaxBenChrist

Any update on this? Is the description clear?

MaxBenChrist avatar Apr 13 '21 23:04 MaxBenChrist

I'd like an update on this as well. Some kind of data preprocessing pipeline on each fold would be great.

ianbenlolo avatar Apr 28 '21 14:04 ianbenlolo