tpot
tpot copied to clipboard
Run custom pipeline / feature extraction on each cv fold
Hi tpot team,
Thank you for this great library. I am currently using it on a few datasets and the results are great (especially if you tune the config to your problem).
Now, for a data set, I need to perform a feature extraction that I would consider sensitive to the samples in the train set. I have a custom pipeline to extract those features. In the following example that feature extraction pipeline is pipe
. The steps of that pipeline are not important, those are custom transformers.
pipe = Pipeline([
('last_foo', AddLastFromGroup(),
('last_bar', AddLastFromGroup()),
('missing_indicator', AddMissingIndicator()),
('imputer_groups', GroupedImputer()),
('imputer_median', MeanMedianImputer()),
('imputer_categories', CategoricalImputer()),
('foo', DropFeatures()),
('baz', RareLabelEncoder()),
('bu', OneHotEncoder()),
])
X_train = pipe.fit_transform(df_train)
X_test = pipe.transform(df_test)
tpot = TPOTRegressor(cv=10)
tpot.fit(X_train, y_train)
The features calculated by pipe
I put into tpot. However, when tpot runs a cross-validation as in tpot.fit(X_train, y_train)
, it creates a data leakage, because it uses the values of the features calculate on the whole train set df_train
, so it uses samples from the cv test fold. This is a data leakage and is a problem for me as it overestimates the importance of certain features.
So, how can I run the pipe
to create the features in each of the 10 cross-validation train and test folds inside tpot? Essentially, I want every tpot pipeline to start with pipe
and then extend it by estimators, selectors and regressors. In that case, I would call tpot.fit(df_train, y_train)
instead of tpot.fit(X_train, y_train)
.
I was thinking about using the template argument, I looked into the tpot source code but I am a little bit lost. Unfortunately, it is not that greatly documented, I guess you would have to somehow fix my pipe
as the root of the tree in https://github.com/EpistasisLab/tpot/blob/master/tpot/base.py#L444:L508?
Finally, where can I find a description of the genetic algorithm that is used in tpot?
Any update on this? Is the description clear?
I'd like an update on this as well. Some kind of data preprocessing pipeline on each fold would be great.