sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

Add an option to DataFrameMapper to add missing columns

Open gsmafra opened this issue 7 years ago • 4 comments

I am currently working on a workflow where we convert database records directly to a pandas DataFrame then applying ML algorithms on it with the help of sklearn-pandas. However, sometimes we have the problem that these records don't have all the features used for prediction and I have to add those columns to the DataFrame, and for that I did a custom transformer to be applied before DataFrameMapper:

from sklearn.pipeline import BaseEstimator, TransformerMixin


class ColumnInserter(BaseEstimator, TransformerMixin):

    def __init__(self):

        self.columns = []

    def fit(self, df=None, y=None):

        self.columns = list(df.keys())
        return self

    def transform(self, df):

        df_new = df.copy()

        # insert missing columns
        missing_cols = set(self.columns) - set(df.columns)
        for col in missing_cols:
            df_new[col] = None

        return df_new

Maybe it would be useful also to others to have this kind of feature in sklearn-pandas itself, probably using the columns specified in the features parameter.

gsmafra avatar Jul 07 '17 14:07 gsmafra

I might add an option to the DataFrameMapper.__init__ called missing_features.

This parameter would have 2 options:

  • 'raise' (default). Raise an error if some feature is missing (current behaviour).
  • 'add'. Fill the missing feature with None or NaN and pass it to the transformers.

What do you think?

arnau126 avatar Jul 12 '17 11:07 arnau126

@arnau126 I can't think of any other options to have in the future, so we could as well make it a boolean, couldn't we? The most intuitive name would probably be insert_missing_features or add_missing_features, don't know if that looks too long.

gsmafra avatar Jul 12 '17 12:07 gsmafra

I believe this functionality, if implemented, would better be a component outside of the DataFrameMapper, to avoid overloading this class with too complex custom behaviour - it's already quite complex, with lots of options.

I see it more as a kind of "column imputer" transformer. I'm good with adding this transformer as part of the package if @arnau126 agrees as well. Then we would need a PR with some extra documentation advertising this feature.

Thanks @gsmafra !

dukebody avatar Jul 29 '17 08:07 dukebody

I think you can incorporate this directly int a DataFrameMapper (since you can select columns multiple times). Otherwise you might want to do a Feature Union (a short implementation for data frames can be found here

datajanko avatar Feb 02 '18 18:02 datajanko