autofeat icon indicating copy to clipboard operation
autofeat copied to clipboard

Data validation error when using Buckingham's Pi Theorem on Classification task

Open aclemente-bigml opened this issue 4 years ago • 1 comments

Hi! While trying to use the AutoFeatClassifier using units, I stumbled upon a validation error caused by an infinite value. Presumably one of the generated features (I assume from the ones coming from the Pi theorem) has an infinite value, which breaks the StandardScaler used during the filtering of correlated features. This is how I am calling the classifier, with fitting the training data that comes in a numpy ndarray

auto = AutoFeatClassifier(categorical_cols=categorical_cols, units=units, verbose=1, feateng_steps=3, featsel_runs=5, n_jobs=5, apply_pi_theorem=True)
X_train_new = auto.fit_transform(X_train_sampled, y_train_sampled)

These are the features logged for the Pi Theorem, and all of them include divisions (which could lead to a division by 0 issue).

...
[AutoFeat] Applying the Pi Theorem
[AutoFeat] Pi Theorem 1:  x002 / x001
[AutoFeat] Pi Theorem 2:  x006 / x000
[AutoFeat] Pi Theorem 3:  x010 / x005
[AutoFeat] Pi Theorem 4:  x003 / x001
[AutoFeat] Pi Theorem 5:  x013 / x001
[AutoFeat] Pi Theorem 6:  x014 / x005
[AutoFeat] Pi Theorem 7:  x000 * x005 * x012 / x015
[AutoFeat] Pi Theorem 8:  x016 / x000
[AutoFeat] Pi Theorem 9:  x017 / x012
...

The full logs output by a failing run is the following:

[AutoFeat] Applying the Pi Theorem
[AutoFeat] Pi Theorem 1:  x002 / x001
[AutoFeat] Pi Theorem 2:  x006 / x000
[AutoFeat] Pi Theorem 3:  x007 / x005
[AutoFeat] Pi Theorem 4:  x003 / x001
[AutoFeat] Pi Theorem 5:  x009 / x001
[AutoFeat] Pi Theorem 6:  x010 / x005
[AutoFeat] Pi Theorem 7:  x000 * x005 * x008 / x011
[AutoFeat] Pi Theorem 8:  x012 / x000
[AutoFeat] Pi Theorem 9:  x013 / x008
[AutoFeat] The 3 step feature engineering process could generate up to 118923 features.
[AutoFeat] With 121 data points this new feature matrix would use about 0.06 gb of space.
[feateng] Step 1: transformation of original features
[feateng] Generated 40 transformed features from 14 original features - done.
[feateng] Step 2: first combination of features
[feateng] Generated 1524 feature combinations from 1431 original feature tuples - done.
[feateng] Step 3: transformation of new features
[feateng] Generated 4564 transformed features from 1524 original features - done.
[feateng] Generated altogether 6233 new features in 3 steps
[feateng] Removing correlated features, as well as additions at the highest level

And after that, the error is reported with the following stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-323-53dcdfc1b68e> in <module>
     32     # categorical_cols = []
     33     auto = AutoFeatClassifier(categorical_cols=categorical_cols, units=units, verbose=1, feateng_steps=3, featsel_runs=5, n_jobs=5, apply_pi_theorem=True)
---> 34     X_train_new = auto.fit_transform(X_train_sampled, y_train_sampled)
     35     X_test_new = auto.transform(X_test.to_numpy())
     36     pretty_names = feature_names(auto, USEFUL_ACTUALS)

~/.pyenv/versions/features/lib/python3.7/site-packages/autofeat/autofeat.py in fit_transform(self, X, y)
    299         # generate features
    300         df_subs, self.feature_formulas_ = engineer_features(df_subs, self.feateng_cols_, _parse_units(self.units, verbose=self.verbose),
--> 301                                                             self.feateng_steps, self.transformations, self.verbose)
    302         # select predictive features
    303         if self.featsel_runs <= 0:

~/.pyenv/versions/features/lib/python3.7/site-packages/autofeat/feateng.py in engineer_features(df_org, start_features, units, max_steps, transformations, verbose)
    354     if cols:
    355         # check for correlated features again; this time with the start features
--> 356         corrs = dict(zip(cols, np.max(np.abs(np.dot(StandardScaler().fit_transform(df[cols]).T, StandardScaler().fit_transform(df_org))/df_org.shape[0]), axis=1)))
    357         cols = [c for c in cols if corrs[c] < 0.9]
    358     cols = list(df_org.columns) + cols

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    688         if y is None:
    689             # fit method of arity 1 (unsupervised transformation)
--> 690             return self.fit(X, **fit_params).transform(X)
    691         else:
    692             # fit method of arity 2 (supervised transformation)

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
    665         # Reset internal state before fitting
    666         self._reset()
--> 667         return self.partial_fit(X, y)
    668 
    669     def partial_fit(self, X, y=None):

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
    696         X = self._validate_data(X, accept_sparse=('csr', 'csc'),
    697                                 estimator=self, dtype=FLOAT_DTYPES,
--> 698                                 force_all_finite='allow-nan')
    699 
    700         # Even in the case of `with_mean=False`, we update the mean anyway

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    418                     f"requires y to be passed, but the target y is None."
    419                 )
--> 420             X = check_array(X, **check_params)
    421             out = X
    422         else:

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    643         if force_all_finite:
    644             _assert_all_finite(array,
--> 645                                allow_nan=force_all_finite == 'allow-nan')
    646 
    647     if ensure_min_samples > 0:

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     97                     msg_err.format
     98                     (type_err,
---> 99                      msg_dtype if msg_dtype is not None else X.dtype)
    100             )
    101     # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains infinity or a value too large for dtype('float64').

I tried removing all the constant features from the original dataset, so that all the original features have std() > 0. Looks like a feature generated has a division by zero somewhere that leads to an infinite value somewhere deep in the generated features. Maybe there should be some handling there, ignoring the feature or replacing the infinites with NaN which the scalers know to ignore?

aclemente-bigml avatar Jan 29 '21 17:01 aclemente-bigml

Thanks for flagging this - I'll have a look & add some extra checks!

cod3licious avatar Jan 29 '21 19:01 cod3licious