woodwork icon indicating copy to clipboard operation
woodwork copied to clipboard

Add automatic fallback to nullable logical types

Open thehomebrewnerd opened this issue 3 years ago • 2 comments

As a user, I wish Woodwork would automatically fallback to nullable types if I attempt to initialize using a non-nullable logical type on data that contains null values, raising a warning to notify me this has happened. This would be useful in situations where a null value has been added to a column or columns that did not originally have missing values and I need to reinitialize Woodwork. Adding this behavior would allow for reinitialization with an existing schema, which would fail today because the new dtypes in the modified data will be incompatible with the logical type dtypes in the schema.

Code Example 1

import pandas as pd
import woodwork as ww

df = pd.DataFrame({'id': [0, 1, 2], 'vals': [1, 2, pd.NA]})
df.ww.init(logical_types={'vals': 'Integer'})
WoodworkInitWarning: Data for column `vals` is incompatible with logical type `Integer`. Using `IntegerNullable` instead.
df.ww
       Physical Type     Logical Type Semantic Tag(s)
Column                                               
id             int64          Integer     ['numeric']
vals           Int64  IntegerNullable     ['numeric']

Code Example 2

import pandas as pd
import woodwork as ww

df = pd.DataFrame({'id': [0, 1, 2], 'vals': [1, 2, 3]})
df.ww.init()

new_df = df.ww.replace({3: pd.NA})
new_df.ww.init(schema=df.ww.schema)
WoodworkInitWarning: Data for column `vals` is incompatible with logical type `Integer`. Using `IntegerNullable` instead.
df.ww
       Physical Type     Logical Type Semantic Tag(s)
Column                                               
id             int64          Integer     ['numeric']
vals           Int64  IntegerNullable     ['numeric']

thehomebrewnerd avatar Aug 27 '21 13:08 thehomebrewnerd

Another idea to achieve this kind of fallback behavior would be to let woodwork users specify how logical types should fall back. For example, letting users who can't use the nullable dtypes choose Integer: Double over Integer: IntegerNullable via a fallback_logical_types parameter. Otherwise, users wouldn't have a way of avoiding those dtypes.

Though I'm not sure if that would be a config or something inside of Woodwork init. I like Woodwork init because it also implicitly adds the ability to avoid this behavior entirely.

There are many reasons that dtype conversion could fail, and if we only want to make the fallback happen if it's because of nullability, then we'd need to be really clear about that. So naming it something like fallback_nullable_logical_types and require that all keys be a non nullable logical type and all values be a nullable logical type

tamargrey avatar Sep 14 '21 17:09 tamargrey

Will wait on https://github.com/alteryx/featuretools/issues/1686

gsheni avatar Nov 18 '21 22:11 gsheni