scikit-learn-intelex
scikit-learn-intelex copied to clipboard
ValueError: "Unknown label type: 'unknown'" when class column has Pandas nullable type like Int64
Describe the bug
Accidentally first posted at https://github.com/scikit-learn/scikit-learn as issue https://github.com/scikit-learn/scikit-learn/issues/25953.
I often use Pandas to load data from CSV and transform it. Pandas tends to parse integer columns as floating point type, so I usually use df = df.convert_dtypes()
to bring those columns back to an integer type.
By design (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html), Pandas convert_dtypes()
converts all output columns to the corresponding nullable extension types (such as Float64
), not "simple" types (such as float64
).
When I try to train some Scikit-Learn models like RandomForestClassifier
on such data I get the error ValueError: Unknown label type: 'unknown'
.
This bug relates to already fixed https://github.com/scikit-learn/scikit-learn/issues/25953.
Steps/Code to Reproduce
from sklearnex import patch_sklearn
patch_sklearn()
import sklearn
import pandas
df = pandas.DataFrame({"class": [0, 1, 0, 1, 1], "feature_1": [0.1, 0.2, 0.3, 0.4, 0.5]})
df = df.convert_dtypes()
model = sklearn.ensemble.RandomForestClassifier()
model.fit(
X=df.drop(columns="class"),
y=df["class"],
)
Expected Results
No error is thrown
Actual Results
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[96], line 7
5 df = df.convert_dtypes()
6 model = sklearn.ensemble.RandomForestClassifier()
----> 7 model.fit(
8 X=df.drop(columns="class"),
9 y=df["class"],
10 )
File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/_device_offload.py:88, in support_usm_ndarray.<locals>.decorator.<locals>.wrapper_with_self(self, *args, **kwargs)
86 @wraps(func)
87 def wrapper_with_self(self, *args, **kwargs):
---> 88 return wrapper_impl(self, *args, **kwargs)
File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/_device_offload.py:74, in support_usm_ndarray.<locals>.decorator.<locals>.wrapper_impl(obj, *args, **kwargs)
72 usm_iface = _extract_usm_iface(*args, **kwargs)
73 q, hostargs, hostkwargs = _get_host_inputs(*args, **kwargs)
---> 74 result = _run_on_device(func, q, obj, *hostargs, **hostkwargs)
75 if usm_iface is not None and hasattr(result, '__array_interface__'):
76 return _copy_to_usm(q, result)
File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/_device_offload.py:65, in _run_on_device(func, queue, obj, *args, **kwargs)
62 with sycl_context('gpu' if queue.sycl_device.is_gpu else 'cpu',
63 host_offload_on_fail=host_offload):
64 return dispatch_by_obj(obj, func, *args, **kwargs)
---> 65 return dispatch_by_obj(obj, func, *args, **kwargs)
File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/_device_offload.py:53, in _run_on_device.<locals>.dispatch_by_obj(obj, func, *args, **kwargs)
51 def dispatch_by_obj(obj, func, *args, **kwargs):
52 if obj is not None:
---> 53 return func(obj, *args, **kwargs)
54 return func(*args, **kwargs)
File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/ensemble/_forest.py:697, in RandomForestClassifier.fit(self, X, y, sample_weight)
670 @support_usm_ndarray()
671 def fit(self, X, y, sample_weight=None):
672 """
673 Build a forest of trees from the training set (X, y).
674
(...)
695 self : object
696 """
--> 697 return _fit_classifier(self, X, y, sample_weight=sample_weight)
File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/ensemble/_forest.py:361, in _fit_classifier(self, X, y, sample_weight)
359 _patching_status.write_log()
360 if _dal_ready:
--> 361 _daal_fit_classifier(self, X, y, sample_weight=sample_weight)
363 self.estimators_ = self._estimators_
365 # Decapsulate classes_ attributes
File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/ensemble/_forest.py:178, in _daal_fit_classifier(self, X, y, sample_weight)
176 def _daal_fit_classifier(self, X, y, sample_weight=None):
177 y = check_array(y, ensure_2d=False, dtype=None)
--> 178 y, expanded_class_weight = self._validate_y_class_weight(y)
179 n_classes_ = self.n_classes_[0]
180 self.n_features_in_ = X.shape[1]
File ~/.conda/envs/dis/lib/python3.11/site-packages/sklearn/ensemble/_forest.py:746, in ForestClassifier._validate_y_class_weight(self, y)
745 def _validate_y_class_weight(self, y):
--> 746 check_classification_targets(y)
748 y = np.copy(y)
749 expanded_class_weight = None
File ~/.conda/envs/dis/lib/python3.11/site-packages/sklearn/utils/multiclass.py:218, in check_classification_targets(y)
210 y_type = type_of_target(y, input_name="y")
211 if y_type not in [
212 "binary",
213 "multiclass",
(...)
216 "multilabel-sequences",
217 ]:
--> 218 raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'unknown'
Versions
System:
python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0]
executable: /lyceum/sm4u19/.conda/envs/dis/bin/python
machine: Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.17
Python dependencies:
sklearn: 1.2.2
pip: 23.0.1
setuptools: 67.6.0
numpy: 1.23.5
scipy: 1.10.1
Cython: None
pandas: 1.5.3
matplotlib: 3.7.1
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: mkl
prefix: libmkl_rt
filepath: /mainfs/scratch/sm4u19/.conda/envs/dis/lib/libmkl_rt.so.1
version: 2021.4-Product
threading_layer: intel
num_threads: 16
user_api: openmp
internal_api: openmp
prefix: libomp
filepath: /mainfs/scratch/sm4u19/.conda/envs/dis/lib/libomp.so
version: None
num_threads: 16
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /mainfs/scratch/sm4u19/.conda/envs/dis/lib/libgomp.so.1.0.0
version: None
num_threads: 16
This bug connected to outdated input validation. I will search how to update it better.
It was solved with changes associated with sklearn 1.2, analysis ongoing.
There seems to be an issue with sklearn's validate_data when multi_output=False. It is not setting y to the proper dtype. A fix is underway in this PR: #1939