woodwork icon indicating copy to clipboard operation
woodwork copied to clipboard

Woodwork initialization fails with nullable int in dataframe

Open leahmcguire opened this issue 1 year ago • 1 comments

If you try to initialize woodwork on a dataframe with nullable integers it correctly infers the type but throws an error when it uses the pandas type conversion on the underlying data (because pandas turns anything with nans into floats)

The error is:

    def transform(self, series, null_invalid_values=False):
        """Converts the series dtype to match the logical type's if it is different."""
        new_dtype = self._get_valid_dtype(type(series))
        if new_dtype != str(series.dtype):
            # Update the underlying series
            try:
                series = series.astype(new_dtype)
            except (TypeError, ValueError):
>               raise TypeConversionError(series, new_dtype, type(self))
E               woodwork.exceptions.TypeConversionError: Error converting datatype for b from type float64 to type Int64. Please confirm the underlying data is consistent with logical type IntegerNullable.

/usr/local/lib/python3.9/site-packages/woodwork/logical_types.py:76: TypeConversionError

Code Sample, a copy-pastable example to reproduce your bug.

        data = DataFrame.from_dict(
            {
                "a": [numpy.inf, numpy.nan, numpy.NAN, -numpy.Inf, None],
                "b": [numpy.finfo("d").max, numpy.finfo("d").min, 3, 1, None],
            }
        )
        data = data.ww.init()

leahmcguire avatar Oct 25 '22 17:10 leahmcguire

pandas==1.4.4 woodwork[dask]==0.19.0

leahmcguire avatar Oct 25 '22 17:10 leahmcguire

Hey @leahmcguire, thanks for reporting this bug! What's happening is that our inference function for IntegerNullable purely checks whether or not the values are integer values (using mod logic), but we should also check that the values are in the accepted range of Int64 values. numpy.finfo("d").max provides a value outside of the valid range, resulting in this error.

For whoever picks this up, updating our inference function to something along the lines of

import sys

max_int = sys.maxsize
min_int = -sys.maxsize - 1

def integer_nullable_func(series):
    if pdtypes.is_integer_dtype(series.dtype):
        threshold = ww.config.get_option("numeric_categorical_threshold")
        if threshold is not None:
            return not _is_categorical_series(series, threshold)
        else:
            return True
    elif pdtypes.is_float_dtype(series.dtype):

        def _is_valid_int(value):
            return value >= min_int and value <= max_int and value.is_integer()

        if not series.isnull().any():
            return False
        series_no_null = series.dropna()
        return all([_is_valid_int(v) for v in series_no_null])

    return False

should move us in the right direction of a solution.

bchen1116 avatar Nov 08 '22 18:11 bchen1116