sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

Unexpected Dropping of columns

Open minorchange opened this issue 2 years ago • 2 comments

In the following lines the resulting prints do not change if the line drop_cols=["salary"] is commented out:

import sklearn.preprocessing
import pandas as pd
import sklearn_pandas


data = pd.DataFrame(
    {
        "pet": ["cat", "dog", "dog", "fish", "cat", "dog", "cat", "fish"],
        "children": [4.0, 6, 3, 3, 2, 3, 5, 4],
        "salary": [90.0, 24, 44, 27, 32, 59, 36, 27],
    }
)

mapper = sklearn_pandas.DataFrameMapper(
    [
        ("pet", sklearn.preprocessing.LabelBinarizer()),
        (["children"], sklearn.preprocessing.StandardScaler()),
    ],
    input_df=True,
    df_out=True,
    drop_cols=["salary"],
)

print(data)
print()
print(mapper.fit_transform(data.copy()))

In both the uncommented and the commented case there is no salary column in the transformed dataframe. I would have expected that unmentioned columns are not touched, especially since the drop_cols option exists.

Is this just me having arbitrary expectations or is there something strange going on?

minorchange avatar Jul 29 '22 11:07 minorchange

I have modified the _build(self, X=None): function inside DataFrameMapper class and added code to filter the columns based on self.drop_cols variable.

Previous build function:

 def _build(self, X=None):
        """
        Build attributes built_features and built_default.
        """
        if isinstance(self.features, list):
            self.built_features = [
                _build_feature(*f, X=X) for f in self.features
            ]
        else:
            self.built_features = _build_feature(*self.features, X=X)
        self.built_default = _build_transformer(self.default)

Modified code:

 def _build(self, X=None):
        """
        Build attributes built_features and built_default.
        """

        if isinstance(self.features, list):
 
            filtered_list = []
            for obj in self.features:
                if isinstance(obj[0], list):
                    new_cols = [col for col in obj[0] if col not in self.drop_cols]
                   
                    new_tuple = tuple([new_cols] + list(obj[1:]))
                    filtered_list.append(new_tuple)
                else:
                    if obj[0] not in self.drop_cols:
                        filtered_list.append(obj)
            self.features = filtered_list

            self.built_features = [_build_feature(*f, X=X) for f in self.features]
        else:
            self.built_features = _build_feature(*self.features, X=X)
        self.built_default = _build_transformer(self.default)

Any feedback or suggestions on my code changes would be greatly appreciated. Thank you!

namanmistry avatar May 14 '23 07:05 namanmistry

你好,已收到,谢谢。

hu-minghao avatar May 14 '23 07:05 hu-minghao