evalml
evalml copied to clipboard
AutoMLSearch fails with Ordinal logical type input from Featuretools
AutoMLSearch fails if the input contains Ordinal data from Featuretools, such as that generated by the Year
, Month
, etc primitives.
Code Sample, a copy-pastable example to reproduce your bug.
import featuretools as ft
from evalml import AutoMLSearch
import pandas as pd
df = pd.read_csv("delhi_200.csv")
es = ft.EntitySet()
es.add_dataframe(dataframe_name="df", dataframe=df, index="id", make_index=True, time_index="date")
es["df"].ww
trans_primitives = ["day"]
features = ft.dfs(entityset=es,
target_dataframe_name="df",
max_depth=1,
features_only=True,
trans_primitives=trans_primitives)
features.append(ft.Feature(es["df"].ww["date"]))
fm = ft.calculate_feature_matrix(entityset=es, features=features)
y = fm.ww.pop("meantemp")
X = fm
problem_configuration={"gap": 0, "max_delay": 7, "forecast_horizon": 7, "time_index": "date"}
automl = AutoMLSearch(
X,
y,
problem_type="time series regression",
problem_configuration=problem_configuration,
)
automl.search()
Random Forest Regressor w/ Replace Nullable Types Transformer + Imputer + Time Series Featurizer + DateTime Featurizer + One Hot Encoder + Drop NaN Rows Transformer fold 0: Encountered an error.
Random Forest Regressor w/ Replace Nullable Types Transformer + Imputer + Time Series Featurizer + DateTime Featurizer + One Hot Encoder + Drop NaN Rows Transformer fold 0: All scores will be replaced with nan.
Fold 0: Exception during automl search: Input contains NaN
...
AutoMLSearchException: All pipelines in the current AutoML batch produced a score of np.nan on the primary objective <evalml.objectives.standard_metrics.MedianAE object at 0x2898447c0>.
Converting the logical type of the DAY(date)
column in the feature matrix from Ordinal
to Categorical
seems to resolve this issue, FWIW.
@thehomebrewnerd This appears to have been fixed!
Running your code to reproduce on main works just fine, and it produces the errors described here in evalml 0.57.0 (which was just what i checked out because it was from around when this was posted). I'll do a git bisect tomorrow to see if I can find the exact commit where this was fixed, but I expect we should be able to close this issue.
Okay decided to do the git bisect today in he end: It seems that fc982d77f015a8040effc1a0d58fad4eaa0ade6a is the first commit that fixes this bug. My guess is that something about explicitly setting types instead of leaving things up to woodwork inference basically achieves the same thing as setting the type to be categorical. @eccabay might have some insight into what the specific change in that commit would have been.
Assuming it is some sort of explicit type setting that removes the ordinal type, though, this doesn't really solve a potential issue that may exist with Ordinals being used in EvalML. And once we integrate the Ordinal Encoder into EvalML (in #3765), this issue may become relevant again, so I'll dig a little deeper here before closing.
Found the culprit: https://github.com/alteryx/evalml/commit/fc982d77f015a8040effc1a0d58fad4eaa0ade6a#diff-1839ccf15077fbf3e37f6e638f66745f6f354cec3df273e2cc9b7b3e40f3863dL261
We were previously only transforming lagged Categorical
logical type columns to be doubles (because TimeSeriesFeatureizer._get_categorical_columns
does X.ww.select(["categorical", "boolean"]
, which it still does, so it's probably worth double checking that we want that method to exclude any columns with the category
semantic tag like Ordinal or PostalCode).
So Becca's change made it so that we transform any delayed feature to be double so that it's now ignored by the onehot encoder like the rest of categorical columns. I can't tell if that's a desired behavior (I suspect it might be, because we'd need to handle nans with the one-hot encoder to allow lagged features). The closest issue I could find to this is https://github.com/alteryx/evalml/issues/2967. So this might be a known issue, but I thought I'd bring it up.
Regarding whether or not this is relevant for #3765: It isn't, but that's only because Ordinal columns won't make it to the ordinal encoder in time series because they'll have been lagged and turned into Doubles. If we want to allow lagged categorical features to be used by the onehot and ordinal encoders, that should be its own issue.
@tamargrey Still trying to digest this information just a but, but how can we do a categorical to double conversion reliably since categories don't have to be numeric?
@tamargrey Still trying to digest this information just a but, but how can we do a categorical to double conversion reliably since categories don't have to be numeric?
@thehomebrewnerd I believe it happens via _encode_X_while_preserving_index
, which will turn all the categories into numbers (here).
But the fact that _get_categorical_columns
ignores Ordinal and other logical types with category
standard tags means that those columns wouldn't get ordinally encoded and if the data wasn't already numeric in nature, we will have problems with any non numeric ordinal or category feature. It should be a quick fix, but I would want to talk to other folks on the modeling team before making this change.