woodwork
woodwork copied to clipboard
Add automatic fallback to nullable logical types
As a user, I wish Woodwork would automatically fallback to nullable types if I attempt to initialize using a non-nullable logical type on data that contains null values, raising a warning to notify me this has happened. This would be useful in situations where a null value has been added to a column or columns that did not originally have missing values and I need to reinitialize Woodwork. Adding this behavior would allow for reinitialization with an existing schema, which would fail today because the new dtypes in the modified data will be incompatible with the logical type dtypes in the schema.
Code Example 1
import pandas as pd
import woodwork as ww
df = pd.DataFrame({'id': [0, 1, 2], 'vals': [1, 2, pd.NA]})
df.ww.init(logical_types={'vals': 'Integer'})
WoodworkInitWarning: Data for column `vals` is incompatible with logical type `Integer`. Using `IntegerNullable` instead.
df.ww
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
vals Int64 IntegerNullable ['numeric']
Code Example 2
import pandas as pd
import woodwork as ww
df = pd.DataFrame({'id': [0, 1, 2], 'vals': [1, 2, 3]})
df.ww.init()
new_df = df.ww.replace({3: pd.NA})
new_df.ww.init(schema=df.ww.schema)
WoodworkInitWarning: Data for column `vals` is incompatible with logical type `Integer`. Using `IntegerNullable` instead.
df.ww
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['numeric']
vals Int64 IntegerNullable ['numeric']
Another idea to achieve this kind of fallback behavior would be to let woodwork users specify how logical types should fall back. For example, letting users who can't use the nullable dtypes choose Integer: Double
over Integer: IntegerNullable
via a fallback_logical_types
parameter. Otherwise, users wouldn't have a way of avoiding those dtypes.
Though I'm not sure if that would be a config or something inside of Woodwork init. I like Woodwork init because it also implicitly adds the ability to avoid this behavior entirely.
There are many reasons that dtype conversion could fail, and if we only want to make the fallback happen if it's because of nullability, then we'd need to be really clear about that. So naming it something like fallback_nullable_logical_types
and require that all keys be a non nullable logical type and all values be a nullable logical type
Will wait on https://github.com/alteryx/featuretools/issues/1686