evalml icon indicating copy to clipboard operation
evalml copied to clipboard

ARIMARegressor should fill boolean nans with the mode then convert to double to fully support boolean values

Open tamargrey opened this issue 2 years ago • 3 comments

  • As a user, I wish I could pass any boolean column into the ARIMARegressor component. Currently, if you pass in a column with the Boolean logical type, we convert it to Double prior to passing into the estimator. But we do nothing with BooleanNullable columns, which will cause them to raise an error that AutoARIMA can't handle boolean values.

We should add support for all boolean values. Note - this is NOT a nullable type incompatibility. Sktime's AutoARIMA regressor is not meant to take in any boolean value -- nullable type or not--and we do not expect them to add support for that. It is our choice in evalml to convert booleans to doubles to pass into ARIMA, so we should support all boolean types.

We'll need to handle potential null values in the same way we handle null values in numeric columns but use the mode instead of the mean. This means we should fillna with the mode of the boolean nullable columns, then convert to Double.

tamargrey avatar Feb 16 '23 18:02 tamargrey

I would argue this is a nullable type incompatibility. Any null values that enter pipelines with an ARIMARegressor should be imputed before reaching the estimator, and we don't expect our estimators to be able to handle truly null values intrinsically. Therefore, if a BooleanNullable column reached the ARIMARegressor we would expect it to be nullable but have no actual nan values, meaning we should handle it the same way we handle Boolean.

eccabay avatar Feb 27 '23 13:02 eccabay

@eccabay I want to make sure we're separating out the actual handling of null values from the concept of nullable type incompatibilities. Many of our components can't handle nans but can take in Int65 or boolean pandas dtypes, and we don't consider them to have any nullable type incompatibilities. I wouldn't hate the chance to get to standardize how individual components handle nans, but that would've widened the scope of the nullable type work beyond what we need.

Therefore, if a BooleanNullable column reached the ARIMARegressor we would expect it to be nullable but have no actual nan values, meaning we should handle it the same way we handle Boolean.

I 100% agree with this and, to me, this is why it's not a nullable type incompatibility, because, if we ignore the fact that nans might be present (which we generally do for estimators; this one just happens to have some nan imputing in place for numeric types), we can treat them exactly the same by converting to Double. AutoARIMA has a compatibility issue with any boolean type, and we're not expecting that to change.

Compare that to IntegerNullable, where we cannot handle Integer and IntegerNullable columns the same at the moment because of the incompatibility with the IntegerNullable type. Once sktime adds support, though, we'll treat Integer and IntegerNullable exactly the same. The fact that we impute nans in ARIMARegressor is entirely separate from whether the nullable integer type was incompatible.

tamargrey avatar Feb 27 '23 14:02 tamargrey

When this gets implemented, we should add BooleanNullable to the parametrization in test_arima_regressor_with_nullable_types

tamargrey avatar Mar 13 '23 21:03 tamargrey