scikit-learn-intelex icon indicating copy to clipboard operation
scikit-learn-intelex copied to clipboard

ValueError: "Unknown label type: 'unknown'" when class column has Pandas nullable type like Int64

Open smith558 opened this issue 1 year ago • 1 comments

Describe the bug

Accidentally first posted at https://github.com/scikit-learn/scikit-learn as issue https://github.com/scikit-learn/scikit-learn/issues/25953.

I often use Pandas to load data from CSV and transform it. Pandas tends to parse integer columns as floating point type, so I usually use df = df.convert_dtypes() to bring those columns back to an integer type.

By design (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html), Pandas convert_dtypes() converts all output columns to the corresponding nullable extension types (such as Float64), not "simple" types (such as float64).

When I try to train some Scikit-Learn models like RandomForestClassifier on such data I get the error ValueError: Unknown label type: 'unknown'.

This bug relates to already fixed https://github.com/scikit-learn/scikit-learn/issues/25953.

Steps/Code to Reproduce

from sklearnex import patch_sklearn
patch_sklearn()
import sklearn
import pandas

df = pandas.DataFrame({"class": [0, 1, 0, 1, 1], "feature_1": [0.1, 0.2, 0.3, 0.4, 0.5]})
df = df.convert_dtypes()
model = sklearn.ensemble.RandomForestClassifier()
model.fit(
    X=df.drop(columns="class"),
    y=df["class"],
)

Expected Results

No error is thrown

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[96], line 7
      5 df = df.convert_dtypes()
      6 model = sklearn.ensemble.RandomForestClassifier()
----> 7 model.fit(
      8     X=df.drop(columns="class"),
      9     y=df["class"],
     10 )

File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/_device_offload.py:88, in support_usm_ndarray.<locals>.decorator.<locals>.wrapper_with_self(self, *args, **kwargs)
     86 @wraps(func)
     87 def wrapper_with_self(self, *args, **kwargs):
---> 88     return wrapper_impl(self, *args, **kwargs)

File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/_device_offload.py:74, in support_usm_ndarray.<locals>.decorator.<locals>.wrapper_impl(obj, *args, **kwargs)
     72 usm_iface = _extract_usm_iface(*args, **kwargs)
     73 q, hostargs, hostkwargs = _get_host_inputs(*args, **kwargs)
---> 74 result = _run_on_device(func, q, obj, *hostargs, **hostkwargs)
     75 if usm_iface is not None and hasattr(result, '__array_interface__'):
     76     return _copy_to_usm(q, result)

File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/_device_offload.py:65, in _run_on_device(func, queue, obj, *args, **kwargs)
     62         with sycl_context('gpu' if queue.sycl_device.is_gpu else 'cpu',
     63                           host_offload_on_fail=host_offload):
     64             return dispatch_by_obj(obj, func, *args, **kwargs)
---> 65 return dispatch_by_obj(obj, func, *args, **kwargs)

File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/_device_offload.py:53, in _run_on_device.<locals>.dispatch_by_obj(obj, func, *args, **kwargs)
     51 def dispatch_by_obj(obj, func, *args, **kwargs):
     52     if obj is not None:
---> 53         return func(obj, *args, **kwargs)
     54     return func(*args, **kwargs)

File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/ensemble/_forest.py:697, in RandomForestClassifier.fit(self, X, y, sample_weight)
    670 @support_usm_ndarray()
    671 def fit(self, X, y, sample_weight=None):
    672     """
    673     Build a forest of trees from the training set (X, y).
    674 
   (...)
    695     self : object
    696     """
--> 697     return _fit_classifier(self, X, y, sample_weight=sample_weight)

File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/ensemble/_forest.py:361, in _fit_classifier(self, X, y, sample_weight)
    359 _patching_status.write_log()
    360 if _dal_ready:
--> 361     _daal_fit_classifier(self, X, y, sample_weight=sample_weight)
    363     self.estimators_ = self._estimators_
    365     # Decapsulate classes_ attributes

File ~/.conda/envs/dis/lib/python3.11/site-packages/daal4py/sklearn/ensemble/_forest.py:178, in _daal_fit_classifier(self, X, y, sample_weight)
    176 def _daal_fit_classifier(self, X, y, sample_weight=None):
    177     y = check_array(y, ensure_2d=False, dtype=None)
--> 178     y, expanded_class_weight = self._validate_y_class_weight(y)
    179     n_classes_ = self.n_classes_[0]
    180     self.n_features_in_ = X.shape[1]

File ~/.conda/envs/dis/lib/python3.11/site-packages/sklearn/ensemble/_forest.py:746, in ForestClassifier._validate_y_class_weight(self, y)
    745 def _validate_y_class_weight(self, y):
--> 746     check_classification_targets(y)
    748     y = np.copy(y)
    749     expanded_class_weight = None

File ~/.conda/envs/dis/lib/python3.11/site-packages/sklearn/utils/multiclass.py:218, in check_classification_targets(y)
    210 y_type = type_of_target(y, input_name="y")
    211 if y_type not in [
    212     "binary",
    213     "multiclass",
   (...)
    216     "multilabel-sequences",
    217 ]:
--> 218     raise ValueError("Unknown label type: %r" % y_type)

ValueError: Unknown label type: 'unknown'

Versions

System:
    python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0]
executable: /lyceum/sm4u19/.conda/envs/dis/bin/python
   machine: Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.17

Python dependencies:
      sklearn: 1.2.2
          pip: 23.0.1
   setuptools: 67.6.0
        numpy: 1.23.5
        scipy: 1.10.1
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.7.1
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: mkl
         prefix: libmkl_rt
       filepath: /mainfs/scratch/sm4u19/.conda/envs/dis/lib/libmkl_rt.so.1
        version: 2021.4-Product
threading_layer: intel
    num_threads: 16

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /mainfs/scratch/sm4u19/.conda/envs/dis/lib/libomp.so
        version: None
    num_threads: 16

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /mainfs/scratch/sm4u19/.conda/envs/dis/lib/libgomp.so.1.0.0
        version: None
    num_threads: 16

smith558 avatar Mar 27 '23 10:03 smith558

This bug connected to outdated input validation. I will search how to update it better.

Alexsandruss avatar Mar 28 '23 11:03 Alexsandruss

It was solved with changes associated with sklearn 1.2, analysis ongoing.

icfaust avatar Jul 15 '24 12:07 icfaust

There seems to be an issue with sklearn's validate_data when multi_output=False. It is not setting y to the proper dtype. A fix is underway in this PR: #1939

icfaust avatar Jul 16 '24 07:07 icfaust