woodwork
woodwork copied to clipboard
Woodwork Incorrectly Infers Boolean
I would expect the following test to pass. We're seeing within concat_columns
that when a DataFrame with a column with mixed null/integers is passed the Integer
logical type during inference, the init
fails. This is expected and an MR was put up to make concat_columns
resilient to this. When we extended the test to cover Boolean/BooleanNullable, it was discovered that the init
will impute the missing boolean value rather than error out that there was an attempted coercion to a non-nullable type.
I would expect that the following test would pass and also be extendable to Integer/IntegerNullable (and float64/Float64 when they're a thing).
import pytest
import numpy as np
@pytest.mark.parametrize("none_type", [None, np.nan, pd.NA])
@pytest.mark.parametrize("pass_logical_types", [True, False])
def test_boolean_inference(none_type, pass_logical_types):
df = pd.DataFrame({"boolean": [none_type, True, False, True]})
if pass_logical_types:
with pytest.raises(Exception):
# Would expect init to fail as you're trying to coerce a boolean to bool.
df.ww.init(logical_types = {"boolean": Boolean})
else:
df.ww.init()
assert isinstance(df.ww.logical_types["boolean"], BooleanNullable)
@chukarsten @ParthivNaresh pandas library has a new method called convert_dtypes
in version 1.0.0 which can possibly provide better inference for nullable types. (docs)
from woodwork.logical_types import BooleanNullable
import pandas as pd
import numpy as np
for none_type in [None, np.nan, pd.NA]:
# initial dtype is object
series = pd.Series([none_type, True, True], dtype='object')
# method infers dtype to boolean nullable
inferred_dtype = series.convert_dtypes().dtype
assert str(inferred_dtype) == BooleanNullable.primary_dtype
@jeff-hernandez Wow nice catch! We should definitely explore this and see where we can use it. I'm thinking in EvalML if we need quick high level type inference we might be able to use this. In Woodwork we can use the extension concept they provided on top of the smarter inference we're doing for nulls now