evalml icon indicating copy to clipboard operation
evalml copied to clipboard

SimpleImputer can raise `TypeConversionError` if `mean` or `median` strategy used with boolean data

Open tamargrey opened this issue 1 year ago • 1 comments

The following code will attempt to use the mean and median strategies with boolean data, which converts the values to floats and then imputes whatever the mean and median of the data is (which may very well be a floating point value that cannot then be converted back to BooleanNullable as the SimpleImputer currently attempts to do). Note, this is not reachable from AutoMLSearch currently, as the Imputer component keeps this from happening.

    import woodwork as ww
    from evalml.pipelines.components import SimpleImputer
    import pandas as pd
    
    for strategy in ["mean", "median"]:
        X_train = pd.DataFrame(
            {
                "fully_bool": pd.Series([True, False, True, True, True]  ),
                "one_nan": pd.Series([True, False, pd.NA, False, True]  ),
            },
        )
        X_train.ww.init(
            logical_types={
                "fully_bool": "Boolean",
                "one_nan": "BooleanNullable",
            },
        )

        imp = SimpleImputer(
            impute_strategy=strategy,
        )
        imp.fit(X_train)
        with pytest.raises(ww.exceptions.TypeConversionError, match="Error converting datatype for one_nan from type object to type boolean."):
            imp.transform(X_train)

We should handle this situation. We have several options for how to do this:

  • Explicitly disallow "mean" and "median" strategies for boolean values in the simple imputer - this would require adding logic that is, I assume, the reason we have a separate Imputer component in the first place
  • Implicitly disallow "mean" and "median" strategies for boolean data in the simple imputer. Note in the docstring the limitations. This might also be a good time to make it more clear that this component expects all columns to be of the same type.
  • Change those columns' types to Doubles in the new_schema prior to initializing woodwork like we do with IntegerNullable to Double. This doesn't make so much sense to me, as it implies a continuous relationship between boolean values, which doesn't make much sense to me, but if there's a use case for this that I'm missing, we can consider this.

tamargrey avatar Mar 06 '23 14:03 tamargrey

We should also think about this with the TargetImputer, which would have this same problem

tamargrey avatar Mar 06 '23 16:03 tamargrey