skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Passing `return_indices=True` to `skrub.cross_validate` raise an error

Open glemaitre opened this issue 5 months ago • 4 comments

Passing the option return_indices=True in skrub.cross_validate raises an error:

import skrub
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X_a, y_a = make_classification(random_state=0)
X, y = skrub.X(X_a), skrub.y(y_a)
log_reg = LogisticRegression(
    **skrub.choose_float(0.01, 1.0, log=True, name="C")
)
pred = X.skb.apply(log_reg, y=y)
search = pred.skb.get_randomized_search(random_state=0)
skrub.cross_validate(search, pred.skb.get_data(), return_indices=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 12
     10 pred = X.skb.apply(log_reg, y=y)
     11 search = pred.skb.get_randomized_search(random_state=0)
---> 12 skrub.cross_validate(search, pred.skb.get_data(), return_indices=True)

File [~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/skrub/_expressions/_estimator.py:616](http://localhost:8889/lab/tree/~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/skrub/_expressions/_estimator.py#line=615), in cross_validate(pipeline, environment, keep_subsampling, **kwargs)
    614 if (fitted_pipelines := result.pop("estimator", None)) is not None:
    615     result["pipeline"] = [_to_env_pipeline(p) for p in fitted_pipelines]
--> 616 return pd.DataFrame(result)

File [~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/frame.py:778](http://localhost:8889/lab/tree/~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/frame.py#line=777), in DataFrame.__init__(self, data, index, columns, dtype, copy)
    772     mgr = self._init_mgr(
    773         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    774     )
    776 elif isinstance(data, dict):
    777     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 778     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    779 elif isinstance(data, ma.MaskedArray):
    780     from numpy.ma import mrecords

File [~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py:503](http://localhost:8889/lab/tree/~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py#line=502), in dict_to_mgr(data, index, columns, dtype, typ, copy)
    499     else:
    500         # dtype check to exclude e.g. range objects, scalars
    501         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 503 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File [~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py:114](http://localhost:8889/lab/tree/~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py#line=113), in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    111 if verify_integrity:
    112     # figure out the index, if necessary
    113     if index is None:
--> 114         index = _extract_index(arrays)
    115     else:
    116         index = ensure_index(index)

File [~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py:680](http://localhost:8889/lab/tree/~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py#line=679), in _extract_index(data)
    677     raise ValueError("All arrays must be of the same length")
    679 if have_dicts:
--> 680     raise ValueError(
    681         "Mixing dicts with non-Series may lead to ambiguous ordering."
    682     )
    684 if have_series:
    685     if lengths[0] != len(index):

ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

A wild guess is that scikit-learn store those in a dictionary and the indices are NumPy arrays. Since skrub.cross_validate returns a pandas DataFrame, maybe there is an issue to store such arrays as-is.

glemaitre avatar Jul 05 '25 22:07 glemaitre

From IRL discussion:

Good to have, but should not be a blocker for 0.6.0. We can release a fix in 0.6.1

rcap107 avatar Jul 15 '25 09:07 rcap107

why close as not planned? we should at least have a better error message if we don't want to support return_indices

jeromedockes avatar Nov 03 '25 09:11 jeromedockes

I messed up, I thought I was just removing it from the milestone

rcap107 avatar Nov 03 '25 10:11 rcap107

ah ok, thanks! gh interface can be a bit confusing 😅

jeromedockes avatar Nov 03 '25 10:11 jeromedockes