tpot icon indicating copy to clipboard operation
tpot copied to clipboard

Feature Mismatch caused by StackingEstimator: X has ___ features, but ________ is expecting ___ features as input.

Open pjvpjv opened this issue 2 years ago • 2 comments

The way that StackingEstimator is currently coded, it will either add 1 column or (1 + n_classes) columns based on whether predict_proba is an option for the estimator being stacked. This is fantastic! It's the way it should be. When predict_proba isn't around, it just adds one column with the class prediction, and when predict_proba is around it adds both the single column class prediction plus 1 additional column for each class with its probability.

The problem happens with these lines of code in stacking_estimator.py

            # check all values that should be not infinity or not NAN
            if np.all(np.isfinite(y_pred_proba)):
                X_transformed = np.hstack((y_pred_proba, X))

I'm finding that sometimes when I call fit() and there are no infinity/NAN values in y_pred_proba it adds (1 + n_classes) columns to the original features, and now the pipeline is permanently expecting that number of features.

But if later you just run transform() on a different set of X values, and now at least one row of y_pred_proba is indeed nan, nan you run into problems. This happens sometimes when stacking LogisticRegression and this is known behavior from sklearn LR implementation if things aren't scaled ahead of time.

Here's an example. I'm doing binary classification with 19 features. The exported pipeline that was found is this:

exported_pipeline = make_pipeline(
    StackingEstimator(estimator=LogisticRegression(C=20.0, dual=False, penalty="l2")),
    MaxAbsScaler(),
    XGBClassifier(learning_rate=0.1, max_depth=7, min_child_weight=7, n_estimators=100, n_jobs=1, subsample=0.9500000000000001, verbosity=0)
)

When I later run predict_proba on data that wasn't in the training set, I get the following (sometimes): X has 20 features, but MaxAbsScaler is expecting 22 features as input.

What's going on here is that when we did fit(X_train,y_train), there were no rows that blew up the LogisticRegression estimator, so it passed 22 features onto MaxAbsScaler: 19 original features plus 1 class feature plus 2 class probability features (again, I'm doing binary classification). But later when I did exported_pipeline.predict_proba(X), there was a row that blew up the LogisticRegression estimator, so StackingEstimator only passed 20 features onto MaxAbsScaler: 19 original features plus 1 class feature plus 0 class probabilities since np.all(np.isfinite(y_pred_proba)) was false.

I'm not sure exactly how you'll decide you want to handle this. One idea would be overwrite nans in the predict_proba class columns with 1s or 0s based on the overall prediction. So if predict=0, you would set predict_proba to (1,0) in binary classification and if predict=1, you would set predict_proba to (0,1). Or do some sort of imputer for the nans?

In the end, you must do something to fill the blanks because you can't pass 22 columns when doing stacking_estimator.fit() but only 20 columns later when doing stacking_estimator.transform(). It must continue to pass the number of columns it originally passed when it was fit.

If this problem applies to you and the maintainers haven't settled on a fix yet, I suggest turning off LogisticRegression in your tpot config, as that's the main thing I've seen sending the occasional nan in the predict_proba class probabilities.

pjvpjv avatar Aug 19 '22 22:08 pjvpjv

"there was a row that blew up the LogisticRegression" Can you clarify the what is in the row that causes LogisticRegression to sometimes yield different number of outputs? I'm trying to reproduce the issue

perib avatar Sep 29 '22 17:09 perib

This happens when your training data has nothing that causes a mid-pipeline infinity or NAN but then the data you are trying to make predictions on (test data or real world data) does have something that causes a mid-pipeline infinity or NAN. One common way this might happen is through a much larger or much smaller number comes up in the data you're trying to make predictions on. I'm sorry that I can't remember exactly which dataset I was on at the time—but you'll see multiple references in an internet search of people saying that a trained logistic regression model can at times return NaN predictions when predicting on something it didn't see during training, something with much bigger or smaller values. Some people recommend normalization / preprocessing to avoid this. See https://stackoverflow.com/questions/62818741/how-to-fix-nan-values-coming-from-the-implementation-of-logistic-regression. Also see https://github.com/scikit-learn/scikit-learn/issues/17925 for an interesting discussion on this popping up on many different types of models—not just logistic regression which is where it happened for me.

Ultimately since tpot is AutoML where it discovers the pipeline and you can't specify that a scaling preprocessor always comes first, you could fix this in one of three ways:

  • Institute a rule that you always do normalization prior to Logistic Regression. This would fix it a fair bit of the time but not always. And it doesn't fix all the other estimators out there that sometimes produce NaNs when extrapolating outside of the training dataset, so this would not be a global fix for the underlying issue of no NaNs during training but NaNs pop up later during prediction.
  • Fix StackingEstimator so it always works around problems of this sort automatically. See original post for my initial ideas on this.
  • Add a descriptive warning in the StackingEstimator code that explains WHY the pipeline blew up while trying to make a prediction. Currently it fails in an entirely impossible to understand way and I had to spend half a day figuring out where the issue was. This could be as simple as explaining that a NaN arose in the specific classifier that was trying to be stacked.
  • Somehow find a way to propagate NaN rows all the way through the pipeline so that .predict or .predict_proba don't fail but just slap some NaN values on the predictions at the end.

I'm not an expert so I'm not sure which of these is best.

pjvpjv avatar Oct 06 '22 13:10 pjvpjv