Passing `return_indices=True` to `skrub.cross_validate` raise an error
Passing the option return_indices=True in skrub.cross_validate raises an error:
import skrub
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X_a, y_a = make_classification(random_state=0)
X, y = skrub.X(X_a), skrub.y(y_a)
log_reg = LogisticRegression(
**skrub.choose_float(0.01, 1.0, log=True, name="C")
)
pred = X.skb.apply(log_reg, y=y)
search = pred.skb.get_randomized_search(random_state=0)
skrub.cross_validate(search, pred.skb.get_data(), return_indices=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 12
10 pred = X.skb.apply(log_reg, y=y)
11 search = pred.skb.get_randomized_search(random_state=0)
---> 12 skrub.cross_validate(search, pred.skb.get_data(), return_indices=True)
File [~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/skrub/_expressions/_estimator.py:616](http://localhost:8889/lab/tree/~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/skrub/_expressions/_estimator.py#line=615), in cross_validate(pipeline, environment, keep_subsampling, **kwargs)
614 if (fitted_pipelines := result.pop("estimator", None)) is not None:
615 result["pipeline"] = [_to_env_pipeline(p) for p in fitted_pipelines]
--> 616 return pd.DataFrame(result)
File [~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/frame.py:778](http://localhost:8889/lab/tree/~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/frame.py#line=777), in DataFrame.__init__(self, data, index, columns, dtype, copy)
772 mgr = self._init_mgr(
773 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
774 )
776 elif isinstance(data, dict):
777 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 778 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
779 elif isinstance(data, ma.MaskedArray):
780 from numpy.ma import mrecords
File [~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py:503](http://localhost:8889/lab/tree/~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py#line=502), in dict_to_mgr(data, index, columns, dtype, typ, copy)
499 else:
500 # dtype check to exclude e.g. range objects, scalars
501 arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 503 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File [~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py:114](http://localhost:8889/lab/tree/~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py#line=113), in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
111 if verify_integrity:
112 # figure out the index, if necessary
113 if index is None:
--> 114 index = _extract_index(arrays)
115 else:
116 index = ensure_index(index)
File [~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py:680](http://localhost:8889/lab/tree/~/Documents/teaching/forecasting/.pixi/envs/dev/lib/python3.13/site-packages/pandas/core/internals/construction.py#line=679), in _extract_index(data)
677 raise ValueError("All arrays must be of the same length")
679 if have_dicts:
--> 680 raise ValueError(
681 "Mixing dicts with non-Series may lead to ambiguous ordering."
682 )
684 if have_series:
685 if lengths[0] != len(index):
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
A wild guess is that scikit-learn store those in a dictionary and the indices are NumPy arrays. Since skrub.cross_validate returns a pandas DataFrame, maybe there is an issue to store such arrays as-is.
From IRL discussion:
Good to have, but should not be a blocker for 0.6.0. We can release a fix in 0.6.1
why close as not planned? we should at least have a better error message if we don't want to support return_indices
I messed up, I thought I was just removing it from the milestone
ah ok, thanks! gh interface can be a bit confusing 😅