skrub
skrub copied to clipboard
Handle numerical missing values in TableVectorizer
Problem Description
Missing values are not handle by default
A reproducer:
from skrub import TableVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_openml
X_df, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_df, y)
model = make_pipeline(TableVectorizer(), RandomForestClassifier())
model.fit(X_train, y_train).score(X_test, y_test)
Gives
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[1], line 12
9 X_train, X_test, y_train, y_test = train_test_split(X_df, y)
11 model = make_pipeline(TableVectorizer(), RandomForestClassifier())
---> 12 model.fit(X_train, y_train).score(X_test, y_test)
File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/pipeline.py:405, in Pipeline.fit(self, X, y, **fit_params)
403 if self._final_estimator != "passthrough":
404 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 405 self._final_estimator.fit(Xt, y, **fit_params_last_step)
407 return self
File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/ensemble/_forest.py:345, in BaseForest.fit(self, X, y, sample_weight)
343 if issparse(y):
344 raise ValueError("sparse multilabel-indicator for y is not supported.")
--> 345 X, y = self._validate_data(
346 X, y, multi_output=True, accept_sparse="csc", dtype=DTYPE
347 )
348 if sample_weight is not None:
349 sample_weight = _check_sample_weight(sample_weight, X)
File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/base.py:584, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
582 y = check_array(y, input_name="y", **check_y_params)
583 else:
--> 584 X, y = check_X_y(X, y, **check_params)
585 out = X, y
587 if not no_val_X and check_params.get("ensure_2d", True):
File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/utils/validation.py:1106, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1101 estimator_name = _check_estimator_name(estimator)
1102 raise ValueError(
1103 f"{estimator_name} requires y to be passed, but the target y is None"
1104 )
-> 1106 X = check_array(
1107 X,
1108 accept_sparse=accept_sparse,
1109 accept_large_sparse=accept_large_sparse,
1110 dtype=dtype,
1111 order=order,
1112 copy=copy,
1113 force_all_finite=force_all_finite,
1114 ensure_2d=ensure_2d,
1115 allow_nd=allow_nd,
1116 ensure_min_samples=ensure_min_samples,
1117 ensure_min_features=ensure_min_features,
1118 estimator=estimator,
1119 input_name="X",
1120 )
1122 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
1124 check_consistent_length(X, y)
File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/utils/validation.py:921, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
915 raise ValueError(
916 "Found array with dim %d. %s expected <= 2."
917 % (array.ndim, estimator_name)
918 )
920 if force_all_finite:
--> 921 _assert_all_finite(
922 array,
923 input_name=input_name,
924 estimator_name=estimator_name,
925 allow_nan=force_all_finite == "allow-nan",
926 )
928 if ensure_min_samples > 0:
929 n_samples = _num_samples(array)
File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/utils/validation.py:161, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
144 if estimator_name and input_name == "X" and has_nan_error:
145 # Improve the error message on how to handle missing values in
146 # scikit-learn.
147 msg_err += (
148 f"\n{estimator_name} does not accept missing values"
149 " encoded as NaN natively. For supervised learning, you might want"
(...)
159 "#estimators-that-handle-nan-values"
160 )
--> 161 raise ValueError(msg_err)
ValueError: Input X contains NaN.
RandomForestClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
Feature Description
if missing value in numeric, use SimpleImputer
Alternative Solutions
No response
Additional Context
No response
This is not a bug in TableVectorizer: it's down to the learner to handle missing values (because the strategy to handle missing values must differ depending on the learner).
If the learning does not handle missing values, you should add an imputer (as you did)
In addition, RandomForests handle missing values in the upcoming release of scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/5870 So your specific problem will disappear real soon.
However, we recommend using HistGradientBoosting avec RandomForest it often works better.
still the goal of the tablevectorizer is to prepare a table so that the rest of the pipeline will work on it without problems, quite a few estimators lack support for missing values, and missing values are ubiquitous, so it is worth trying to find ways to improve the user experience and I would suggest keeping the issue open for discussion
But I agree that at least the default should probably be to output nans where there are missing values as is currently the case
I agree that this depends on the downstream classifier but I think having an option to "fill missing value" would be a nice feature as the goal of TableVectorizer
is to take a table and "vectorize" it. (that is why this is a feature request and not a bug ;) )
I disagree with your desire to have an option to do it automatically: there is no good default and it tends to depend a lot on the downstream estimator.
If you really want good behavior by default, you should really use HistGradientBoosting, which is very robust to many thing.
And, besides, it's not very hard to write:
make_pipeline(TableVectorizer(), SimpleImputer(), RandomForestClassifier())
Not much more difficult than:
make_pipeline(TableVectorizer(), RandomForestClassifier())
Yes it is easy to fix (I used numerical_transformer=SimpleImputer
) but I found this behavior unexpected as I thought (without reading the doc) that I would get a vector out of this Transformer
.
I find the name confusing as this does not vectorize the table (to me, a vector should have a consistent type for all its entries).
This class only acts on the categories and not the numerical values so maybe it would be better to call it CategoryVectorizer
, to make it clear it does not touch the numerics.
my 2cts :)
This class only acts on the categories and not the numerical values so maybe it would be better to call it CategoryVectorizer, to make it clear it does not touch the numerics.
It also deals with the dates