eli5 icon indicating copy to clipboard operation
eli5 copied to clipboard

PermutationImportance error with XGBoost and NaNs - `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` (with a fix)

Open ianozsvald opened this issue 6 years ago • 6 comments

Using the current version of XGBoost and ELI5 if I add NaN values to X, whilst show_weights works fine PermutationImportance throws an error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

To recreate:

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance

%load_ext watermark
%watermark -d -m -v -p numpy,sklearn,eli5,xgboost,pandas
#2018-05-03 
#CPython 3.6.5
#IPython 6.3.1
#numpy 1.14.2
#sklearn 0.19.1
#eli5 0.8
#xgboost 0.71
#pandas 0.22.0
#compiler   : GCC 4.8.2 20140120 (Red Hat 4.8.2-15)
#system     : Linux
#release    : 4.9.91-040991-generic
#machine    : x86_64
#processor  : x86_64
#CPU cores  : 8
#interpreter: 64bit

# 8 items of data, pairs of useless feature and predictive feature
X_np = np.array([[np.nan, 1,], [0, 1], [0, 1], [0, 1], [0, 1], [0, 2,], [0, 2,], [0, 2,], [0, 2,], [0, 2]])
y_np = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# if we have 10 items (prepared above) XGBClassifier won't fit (but RandomForestClassifer does)
# so the score is 0. If we concatenate to make "more data" (30 items in total) then XGBClassifier
# fits with 100% 
X = np.concatenate((X_np, X_np, X_np))
y = np.concatenate((y_np, y_np, y_np))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
print("X y shapes:", X_train.shape, y_train.shape, X_test.shape, y_test.shape) # (15, 2) (15,) (15, 2) (15,)

est = XGBClassifier()
est.fit(X_train, y_train)
print("Classifier score (should be 1.0):", est.score(X_test, y_test))

perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
#X y shapes: (15, 2) (15,) (15, 2) (15,)
#Classifier score (should be 1.0): 1.0

~/anaconda3/envs/debug_xgb_pandas_eli5/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The call to check_array is using sklearn's constraints and disallows NaN. XGBoost is ok with NaN. My modification (monkey patched here for easy testing) is to call check_array(X, force_all_finite=False):

from sklearn.metrics.scorer import check_scoring  # type: ignore
from sklearn.utils import check_array, check_random_state  # type: ignore

def fit(self, X, y, groups=None, **fit_params):
    # type: (...) -> PermutationImportance
    """Compute ``feature_importances_`` attribute and optionally
    fit the base estimator.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The training input samples.

    y : array-like, shape (n_samples,)
        The target values (integers that correspond to classes in
        classification, real numbers in regression).

    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set.

    **fit_params : Other estimator specific parameters

    Returns
    -------
    self : object
        Returns self.
    """
    self.scorer_ = check_scoring(self.estimator, scoring=self.scoring)

    if self.cv != "prefit" and self.refit:
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(X, y, **fit_params)

    X = check_array(X, force_all_finite=False) 
    #X = check_array(X)

    if self.cv not in (None, "prefit"):
        si = self._cv_scores_importances(X, y, groups=groups, **fit_params)
    else:
        si = self._non_cv_scores_importances(X, y)
    scores, results = si
    self.scores_ = np.array(scores)
    self.results_ = results
    self.feature_importances_ = np.mean(results, axis=0)
    self.feature_importances_std_ = np.std(results, axis=0)
    return self

PermutationImportance.fit = fit
perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
# no errors, reports perm results just fine

It might be wise to try testing for the use of XGB vs sklearn and then force_all_finite could be flipped to preserve the sklearn interpretation?

ianozsvald avatar May 03 '18 21:05 ianozsvald

It might be wise to try testing for the use of XGB vs sklearn and then force_all_finite could be flipped to preserve the sklearn interpretation?

If you go down that route, you should also check to see if the model is also a pipeline with an imputer.

lstmemery avatar May 04 '18 17:05 lstmemery

Or maybe the easy first step is to pass an argument to PermutationImportance to set this flag True or False?

ianozsvald avatar May 04 '18 17:05 ianozsvald

I'll note that with a fresh install of a conda environment, I still get the above issue and using the work-around I posted, it works ok. These are my versions using watermark:

2018-08-12 

CPython 3.6.6
IPython 6.5.0

numpy 1.15.0
matplotlib 2.2.2
sklearn 0.19.1
xgboost 0.72.1
seaborn 0.9.0
pandas 0.23.4
eli5 0.8

ianozsvald avatar Aug 12 '18 17:08 ianozsvald

I have completely the same problem, is there any fix or solution?

stefansimik avatar Dec 08 '18 17:12 stefansimik

I'm looking at this as well as I'm having the same issue.

I don't understand the case against having check_array(X, force_all_finite=False) as default and hardcoded?

It's pretty obvious that the provided data to the model has to be similar as what the model was trained on. So I don't see why we need to do any input validation here.

I can make a PR but I'd like to hear thoughts from a contributor on this.

ihopethiswillfi avatar Dec 14 '18 17:12 ihopethiswillfi

I will pick up the issue

Matgrb avatar May 22 '20 08:05 Matgrb