evalml
evalml copied to clipboard
SimpleImputer can raise `TypeConversionError` if `mean` or `median` strategy used with boolean data
The following code will attempt to use the mean
and median
strategies with boolean data, which converts the values to floats and then imputes whatever the mean and median of the data is (which may very well be a floating point value that cannot then be converted back to BooleanNullable as the SimpleImputer currently attempts to do). Note, this is not reachable from AutoMLSearch currently, as the Imputer
component keeps this from happening.
import woodwork as ww
from evalml.pipelines.components import SimpleImputer
import pandas as pd
for strategy in ["mean", "median"]:
X_train = pd.DataFrame(
{
"fully_bool": pd.Series([True, False, True, True, True] ),
"one_nan": pd.Series([True, False, pd.NA, False, True] ),
},
)
X_train.ww.init(
logical_types={
"fully_bool": "Boolean",
"one_nan": "BooleanNullable",
},
)
imp = SimpleImputer(
impute_strategy=strategy,
)
imp.fit(X_train)
with pytest.raises(ww.exceptions.TypeConversionError, match="Error converting datatype for one_nan from type object to type boolean."):
imp.transform(X_train)
We should handle this situation. We have several options for how to do this:
- Explicitly disallow "mean" and "median" strategies for boolean values in the simple imputer - this would require adding logic that is, I assume, the reason we have a separate
Imputer
component in the first place - Implicitly disallow "mean" and "median" strategies for boolean data in the simple imputer. Note in the docstring the limitations. This might also be a good time to make it more clear that this component expects all columns to be of the same type.
- Change those columns' types to Doubles in the
new_schema
prior to initializing woodwork like we do with IntegerNullable toDouble
. This doesn't make so much sense to me, as it implies a continuous relationship between boolean values, which doesn't make much sense to me, but if there's a use case for this that I'm missing, we can consider this.
We should also think about this with the TargetImputer
, which would have this same problem