woodwork icon indicating copy to clipboard operation
woodwork copied to clipboard

Woodwork Incorrectly Infers Boolean

Open chukarsten opened this issue 2 years ago • 2 comments

I would expect the following test to pass. We're seeing within concat_columns that when a DataFrame with a column with mixed null/integers is passed the Integer logical type during inference, the init fails. This is expected and an MR was put up to make concat_columns resilient to this. When we extended the test to cover Boolean/BooleanNullable, it was discovered that the init will impute the missing boolean value rather than error out that there was an attempted coercion to a non-nullable type.

I would expect that the following test would pass and also be extendable to Integer/IntegerNullable (and float64/Float64 when they're a thing).

import pytest
import numpy as np
@pytest.mark.parametrize("none_type", [None, np.nan, pd.NA])
@pytest.mark.parametrize("pass_logical_types", [True, False])
def test_boolean_inference(none_type, pass_logical_types):
    df = pd.DataFrame({"boolean": [none_type, True, False, True]})
    if pass_logical_types:
        with pytest.raises(Exception):
            # Would expect init to fail as you're trying to coerce a boolean to bool.
            df.ww.init(logical_types = {"boolean": Boolean})
    else:
        df.ww.init()
        assert isinstance(df.ww.logical_types["boolean"], BooleanNullable)

chukarsten avatar Aug 04 '22 01:08 chukarsten

@chukarsten @ParthivNaresh pandas library has a new method called convert_dtypes in version 1.0.0 which can possibly provide better inference for nullable types. (docs)

from woodwork.logical_types import BooleanNullable
import pandas as pd
import numpy as np


for none_type in [None, np.nan, pd.NA]:
    # initial dtype is object
    series = pd.Series([none_type, True, True], dtype='object')

    # method infers dtype to boolean nullable
    inferred_dtype = series.convert_dtypes().dtype
    assert str(inferred_dtype) == BooleanNullable.primary_dtype 

jeff-hernandez avatar Aug 05 '22 16:08 jeff-hernandez

@jeff-hernandez Wow nice catch! We should definitely explore this and see where we can use it. I'm thinking in EvalML if we need quick high level type inference we might be able to use this. In Woodwork we can use the extension concept they provided on top of the smarter inference we're doing for nulls now

ParthivNaresh avatar Aug 05 '22 16:08 ParthivNaresh