evalml
evalml copied to clipboard
TargetDistributionDataCheck: Add support for nullable logical types - scipy nullable type incompatibilities
Currently, the TargetDistributionDataCheck
does not allow nullable logical types. This doesn't match the behavior of InvalidTargetDataCheck
, which does allow nullable types. With the new nullable type support across automl search, we should update TargetDistributionDataCheck
to allow numeric nullable types, AgeNullable
and IntegerNullable
.
We are currently blocked from doing so because incompatibilities the scipy.stats utils jarque_bera
and shapiro
have with nullable types that contain null values.
from scipy.stats import jarque_bera, shapiro
for dtype in ["Int64", "boolean"]:
for scipy_method in [jarque_bera, shapiro]:
# Works if null value isn't present
y = pd.Series([1,0]* 50 , dtype=dtype)
scipy_method(y)
# Breaks if null value is present
y.iloc[-1] = pd.NA
with pytest.raises(TypeError, match="value of NA is ambiguous"):
scipy_method(y)
This is not reachable from the AutoMLSearch
class directly, but is reachable if you call the search
or search_iterative
utilities since they use DefaultDataChecks
, which contain InvalidTargetDataCheck
.
import woodwork as ww
from evalml.automl import search
X, y = X_y_regression
y = ww.init_series(pd.Series(range(len(y))), logical_type="IntegerNullable")
_, data_check_results = search(
X_train=X,
y_train=y,
problem_type="regression",
max_time=42,
patience=3,
tolerance=0.5,
mode="fast",
)
assert data_check_results[0]["message"] == 'Target is unsupported integer_nullable type. Valid Woodwork logical types include: integer, double, age, age_fractional'
We should handle this incompatibility and then allow the nullable numeric types in this data check.