woodwork
woodwork copied to clipboard
Woodwork initialization fails with nullable int in dataframe
If you try to initialize woodwork on a dataframe with nullable integers it correctly infers the type but throws an error when it uses the pandas type conversion on the underlying data (because pandas turns anything with nans into floats)
The error is:
def transform(self, series, null_invalid_values=False):
"""Converts the series dtype to match the logical type's if it is different."""
new_dtype = self._get_valid_dtype(type(series))
if new_dtype != str(series.dtype):
# Update the underlying series
try:
series = series.astype(new_dtype)
except (TypeError, ValueError):
> raise TypeConversionError(series, new_dtype, type(self))
E woodwork.exceptions.TypeConversionError: Error converting datatype for b from type float64 to type Int64. Please confirm the underlying data is consistent with logical type IntegerNullable.
/usr/local/lib/python3.9/site-packages/woodwork/logical_types.py:76: TypeConversionError
Code Sample, a copy-pastable example to reproduce your bug.
data = DataFrame.from_dict(
{
"a": [numpy.inf, numpy.nan, numpy.NAN, -numpy.Inf, None],
"b": [numpy.finfo("d").max, numpy.finfo("d").min, 3, 1, None],
}
)
data = data.ww.init()
pandas==1.4.4 woodwork[dask]==0.19.0
Hey @leahmcguire, thanks for reporting this bug! What's happening is that our inference function for IntegerNullable
purely checks whether or not the values are integer values (using mod logic), but we should also check that the values are in the accepted range of Int64 values. numpy.finfo("d").max
provides a value outside of the valid range, resulting in this error.
For whoever picks this up, updating our inference function to something along the lines of
import sys
max_int = sys.maxsize
min_int = -sys.maxsize - 1
def integer_nullable_func(series):
if pdtypes.is_integer_dtype(series.dtype):
threshold = ww.config.get_option("numeric_categorical_threshold")
if threshold is not None:
return not _is_categorical_series(series, threshold)
else:
return True
elif pdtypes.is_float_dtype(series.dtype):
def _is_valid_int(value):
return value >= min_int and value <= max_int and value.is_integer()
if not series.isnull().any():
return False
series_no_null = series.dropna()
return all([_is_valid_int(v) for v in series_no_null])
return False
should move us in the right direction of a solution.