sklearn-pandas
sklearn-pandas copied to clipboard
Add an option to DataFrameMapper to add missing columns
I am currently working on a workflow where we convert database records directly to a pandas DataFrame then applying ML algorithms on it with the help of sklearn-pandas. However, sometimes we have the problem that these records don't have all the features used for prediction and I have to add those columns to the DataFrame, and for that I did a custom transformer to be applied before DataFrameMapper:
from sklearn.pipeline import BaseEstimator, TransformerMixin
class ColumnInserter(BaseEstimator, TransformerMixin):
def __init__(self):
self.columns = []
def fit(self, df=None, y=None):
self.columns = list(df.keys())
return self
def transform(self, df):
df_new = df.copy()
# insert missing columns
missing_cols = set(self.columns) - set(df.columns)
for col in missing_cols:
df_new[col] = None
return df_new
Maybe it would be useful also to others to have this kind of feature in sklearn-pandas itself, probably using the columns specified in the features
parameter.
I might add an option to the DataFrameMapper.__init__
called missing_features
.
This parameter would have 2 options:
- 'raise' (default). Raise an error if some feature is missing (current behaviour).
- 'add'. Fill the missing feature with None or NaN and pass it to the transformers.
What do you think?
@arnau126 I can't think of any other options to have in the future, so we could as well make it a boolean, couldn't we? The most intuitive name would probably be insert_missing_features
or add_missing_features
, don't know if that looks too long.
I believe this functionality, if implemented, would better be a component outside of the DataFrameMapper
, to avoid overloading this class with too complex custom behaviour - it's already quite complex, with lots of options.
I see it more as a kind of "column imputer" transformer. I'm good with adding this transformer as part of the package if @arnau126 agrees as well. Then we would need a PR with some extra documentation advertising this feature.
Thanks @gsmafra !
I think you can incorporate this directly int a DataFrameMapper (since you can select columns multiple times). Otherwise you might want to do a Feature Union (a short implementation for data frames can be found here