evalml icon indicating copy to clipboard operation
evalml copied to clipboard

Imputer modifies user data when user passes in a DataTable

Open freddyaboulton opened this issue 3 years ago • 3 comments

import woodwork as ww
import pandas as pd
import numpy as np
from evalml.pipelines.components import Imputer
df = ww.DataTable(pd.DataFrame({
        "all nan": [np.nan, np.nan, np.nan, np.nan, np.nan],
        "all nan cat": pd.Series([np.nan, np.nan, np.nan, np.nan, np.nan], dtype='category')
    }))
X = Imputer().fit_transform(df)
assert df.to_dataframe().empty

This came up during #2018 . The imputer is expected to drop all null columns but, as a user, I wouldn't expect the Imputer to modify the data pass in.

The underlying issue is that infer_feature_types does not copy the data when users pass in a data table.

freddyaboulton avatar Mar 29 '21 19:03 freddyaboulton

I would caution about copy the user's data. If the user has a large data set, the copying might be expensive.

For Featuretools, we modify the user dataframe when inputted to an Entity.

gsheni avatar Mar 30 '21 02:03 gsheni

Let's do the copy for now. I agree this has performance implications, but its important to keep our API contract clear.

This only happens when one or more cols is fully-nan, so let's treat it as low priority.

dsherry avatar Apr 01 '21 17:04 dsherry

Since https://github.com/alteryx/evalml/issues/2751 was merged in, can we close out this issue?

gsheni avatar Oct 13 '21 20:10 gsheni