eli5
eli5 copied to clipboard
PermutationImportance error with XGBoost and NaNs - `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` (with a fix)
Using the current version of XGBoost
and ELI5
if I add NaN
values to X
, whilst show_weights
works fine PermutationImportance
throws an error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
To recreate:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance
%load_ext watermark
%watermark -d -m -v -p numpy,sklearn,eli5,xgboost,pandas
#2018-05-03
#CPython 3.6.5
#IPython 6.3.1
#numpy 1.14.2
#sklearn 0.19.1
#eli5 0.8
#xgboost 0.71
#pandas 0.22.0
#compiler : GCC 4.8.2 20140120 (Red Hat 4.8.2-15)
#system : Linux
#release : 4.9.91-040991-generic
#machine : x86_64
#processor : x86_64
#CPU cores : 8
#interpreter: 64bit
# 8 items of data, pairs of useless feature and predictive feature
X_np = np.array([[np.nan, 1,], [0, 1], [0, 1], [0, 1], [0, 1], [0, 2,], [0, 2,], [0, 2,], [0, 2,], [0, 2]])
y_np = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
# if we have 10 items (prepared above) XGBClassifier won't fit (but RandomForestClassifer does)
# so the score is 0. If we concatenate to make "more data" (30 items in total) then XGBClassifier
# fits with 100%
X = np.concatenate((X_np, X_np, X_np))
y = np.concatenate((y_np, y_np, y_np))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
print("X y shapes:", X_train.shape, y_train.shape, X_test.shape, y_test.shape) # (15, 2) (15,) (15, 2) (15,)
est = XGBClassifier()
est.fit(X_train, y_train)
print("Classifier score (should be 1.0):", est.score(X_test, y_test))
perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
#X y shapes: (15, 2) (15,) (15, 2) (15,)
#Classifier score (should be 1.0): 1.0
~/anaconda3/envs/debug_xgb_pandas_eli5/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
42 and not np.isfinite(X).all()):
43 raise ValueError("Input contains NaN, infinity"
---> 44 " or a value too large for %r." % X.dtype)
45
46
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The call to check_array
is using sklearn
's constraints and disallows NaN
. XGBoost is ok with NaN
. My modification (monkey patched here for easy testing) is to call check_array(X, force_all_finite=False)
:
from sklearn.metrics.scorer import check_scoring # type: ignore
from sklearn.utils import check_array, check_random_state # type: ignore
def fit(self, X, y, groups=None, **fit_params):
# type: (...) -> PermutationImportance
"""Compute ``feature_importances_`` attribute and optionally
fit the base estimator.
Parameters
----------
X : array-like of shape (n_samples, n_features)
The training input samples.
y : array-like, shape (n_samples,)
The target values (integers that correspond to classes in
classification, real numbers in regression).
groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.
**fit_params : Other estimator specific parameters
Returns
-------
self : object
Returns self.
"""
self.scorer_ = check_scoring(self.estimator, scoring=self.scoring)
if self.cv != "prefit" and self.refit:
self.estimator_ = clone(self.estimator)
self.estimator_.fit(X, y, **fit_params)
X = check_array(X, force_all_finite=False)
#X = check_array(X)
if self.cv not in (None, "prefit"):
si = self._cv_scores_importances(X, y, groups=groups, **fit_params)
else:
si = self._non_cv_scores_importances(X, y)
scores, results = si
self.scores_ = np.array(scores)
self.results_ = results
self.feature_importances_ = np.mean(results, axis=0)
self.feature_importances_std_ = np.std(results, axis=0)
return self
PermutationImportance.fit = fit
perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
# no errors, reports perm results just fine
It might be wise to try testing for the use of XGB
vs sklearn
and then force_all_finite
could be flipped to preserve the sklearn
interpretation?
It might be wise to try testing for the use of XGB vs sklearn and then force_all_finite could be flipped to preserve the sklearn interpretation?
If you go down that route, you should also check to see if the model is also a pipeline with an imputer.
Or maybe the easy first step is to pass an argument to PermutationImportance
to set this flag True
or False
?
I'll note that with a fresh install of a conda environment, I still get the above issue and using the work-around I posted, it works ok. These are my versions using watermark
:
2018-08-12
CPython 3.6.6
IPython 6.5.0
numpy 1.15.0
matplotlib 2.2.2
sklearn 0.19.1
xgboost 0.72.1
seaborn 0.9.0
pandas 0.23.4
eli5 0.8
I have completely the same problem, is there any fix or solution?
I'm looking at this as well as I'm having the same issue.
I don't understand the case against having check_array(X, force_all_finite=False)
as default and hardcoded?
It's pretty obvious that the provided data to the model has to be similar as what the model was trained on. So I don't see why we need to do any input validation here.
I can make a PR but I'd like to hear thoughts from a contributor on this.
I will pick up the issue