ibis-ml IbisML and GridSearchCV

I may have found a bug related to how IbisML integrates with grid search from sklearn, but it could be that this is out of scope of the project.

import ibis_ml as iml
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

tfm = iml.Recipe(
    iml.Cast(iml.numeric(), "float32")
)
pipe = make_pipeline(tfm, GradientBoostingRegressor())

GridSearchCV(pipe, param_grid={}, cv=5).fit(X_train, y_train).cv_results_

This gave me this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <timed exec>:7

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/base.py:1473](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/base.py#line=1472), in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1466     estimator._validate_params()
   1468 with config_context(
   1469     skip_parameter_validation=(
   1470         prefer_skip_nested_validation or global_skip_validation
   1471     )
   1472 ):
-> 1473     return fit_method(estimator, *args, **kwargs)

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/model_selection/_search.py:968](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/model_selection/_search.py#line=967), in BaseSearchCV.fit(self, X, y, **params)
    962     results = self._format_results(
    963         all_candidate_params, n_splits, all_out, all_more_results
    964     )
    966     return results
--> 968 self._run_search(evaluate_candidates)
    970 # multimetric is determined here because in the case of a callable
    971 # self.scoring the return type is only known after calling
    972 first_test_score = all_out[0]["test_scores"]

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/model_selection/_search.py:1543](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/model_selection/_search.py#line=1542), in GridSearchCV._run_search(self, evaluate_candidates)
   1541 def _run_search(self, evaluate_candidates):
   1542     """Search all candidates in param_grid"""
-> 1543     evaluate_candidates(ParameterGrid(self.param_grid))

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/model_selection/_search.py:914](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/model_selection/_search.py#line=913), in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    906 if self.verbose > 0:
    907     print(
    908         "Fitting {0} folds for each of {1} candidates,"
    909         " totalling {2} fits".format(
    910             n_splits, n_candidates, n_candidates * n_splits
    911         )
    912     )
--> 914 out = parallel(
    915     delayed(_fit_and_score)(
    916         clone(base_estimator),
    917         X,
    918         y,
    919         train=train,
    920         test=test,
    921         parameters=parameters,
    922         split_progress=(split_idx, n_splits),
    923         candidate_progress=(cand_idx, n_candidates),
    924         **fit_and_score_kwargs,
    925     )
    926     for (cand_idx, parameters), (split_idx, (train, test)) in product(
    927         enumerate(candidate_params),
    928         enumerate(cv.split(X, y, **routed_params.splitter.split)),
    929     )
    930 )
    932 if len(out) < 1:
    933     raise ValueError(
    934         "No fits were performed. "
    935         "Was the CV iterator empty? "
    936         "Were there no candidates?"
    937     )

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/parallel.py:67](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/parallel.py#line=66), in Parallel.__call__(self, iterable)
     62 config = get_config()
     63 iterable_with_config = (
     64     (_with_config(delayed_func, config), args, kwargs)
     65     for delayed_func, args, kwargs in iterable
     66 )
---> 67 return super().__call__(iterable_with_config)

File [~/Development/probabl/venv/lib/python3.11/site-packages/joblib/parallel.py:1863](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/joblib/parallel.py#line=1862), in Parallel.__call__(self, iterable)
   1861     output = self._get_sequential_output(iterable)
   1862     next(output)
-> 1863     return output if self.return_generator else list(output)
   1865 # Let's create an ID that uniquely identifies the current call. If the
   1866 # call is interrupted early and that the same instance is immediately
   1867 # re-used, this id will be used to prevent workers that were
   1868 # concurrently finalizing a task from the previous call to run the
   1869 # callback.
   1870 with self._lock:

File [~/Development/probabl/venv/lib/python3.11/site-packages/joblib/parallel.py:1792](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/joblib/parallel.py#line=1791), in Parallel._get_sequential_output(self, iterable)
   1790 self.n_dispatched_batches += 1
   1791 self.n_dispatched_tasks += 1
-> 1792 res = func(*args, **kwargs)
   1793 self.n_completed_tasks += 1
   1794 self.print_progress()

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/parallel.py:129](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/parallel.py#line=128), in _FuncWrapper.__call__(self, *args, **kwargs)
    127     config = {}
    128 with config_context(**config):
--> 129     return self.function(*args, **kwargs)

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/model_selection/_validation.py:880](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/model_selection/_validation.py#line=879), in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, score_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
    876     estimator = estimator.set_params(**clone(parameters, safe=False))
    878 start_time = time.time()
--> 880 X_train, y_train = _safe_split(estimator, X, y, train)
    881 X_test, y_test = _safe_split(estimator, X, y, test, train)
    883 result = {}

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/metaestimators.py:161](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/metaestimators.py#line=160), in _safe_split(estimator, X, y, indices, train_indices)
    158     X_subset = _safe_indexing(X, indices)
    160 if y is not None:
--> 161     y_subset = _safe_indexing(y, indices)
    162 else:
    163     y_subset = None

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/_indexing.py:269](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/_indexing.py#line=268), in _safe_indexing(X, indices, axis)
    267     return _array_indexing(X, indices, indices_dtype, axis=axis)
    268 else:
--> 269     return _list_indexing(X, indices, indices_dtype)

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/_indexing.py:60](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/_indexing.py#line=59), in _list_indexing(X, key, key_dtype)
     58     return list(compress(X, key))
     59 # key is a integer array-like of key
---> 60 return [X[idx] for idx in key]

File [~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/_indexing.py:60](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/sklearn/utils/_indexing.py#line=59), in <listcomp>(.0)
     58     return list(compress(X, key))
     59 # key is a integer array-like of key
---> 60 return [X[idx] for idx in key]

File [~/Development/probabl/venv/lib/python3.11/site-packages/ibis/expr/types/generic.py:1374](http://localhost:8888/lab/tree/~/Development/probabl/venv/lib/python3.11/site-packages/ibis/expr/types/generic.py#line=1373), in Column.__getitem__(self, _)
   1373 def __getitem__(self, _):
-> 1374     raise TypeError(
   1375         f"{self.__class__.__name__!r} is not subscriptable: "
   1376         "see https://ibis-project.org/tutorial/ibis-for-pandas-users/#ibis-for-pandas-users for details."
   1377     )

TypeError: 'IntegerColumn' is not subscriptable: see https://ibis-project.org/tutorial/ibis-for-pandas-users/#ibis-for-pandas-users for details.

It seems that the issues originates from the Ibis side of things, hence the ping. If this is out of scope for this project though I will gladly hear it.

Aug 19 '24 13:08 koaning

If this is out of scope for this project though I will gladly hear it.

Implementing CV isn't necessarily out of scope, but IbisML will probably need to reimplement it, because reproducible shuffling/splitting on databases is hard-ish. Here's the implementation of the much-simpler train_test_split: https://github.com/jitingxu1/ibis-ml/blob/main/ibis_ml/utils/_split.py

CV was something we intentionally deprioritized for initial release, but will definitely take a second look!

Aug 19 '24 15:08 deepyaman

Hi @koaning Thanks for your input.

The reasons we deprioritized it for our initial work is that, I copied my reply in https://github.com/ibis-project/ibis-ml/issues/135#issuecomment-2308956315:

We still could do the model hyperparameter's gridsearchCV based on IbisML's output, but the ibisML recipe should not be included in the sklearn pipeline, since IbisML have no cv splitter right now
IbisML do the feature transformation on a compute backend, it will transfer the preprocessed data from compute backend to the training process, the data movement cost will be significantly high for training with large datasets.

In my limited and outdated modeling experience, random or grid search may not be suitable for tuning some preprocessing transformations. Instead, preprocessing parameter tuning often requires feature analysis, for example, choosing between imputing a feature with the median or mean. Do you have some insights here. we would like to learn to see if we could prioritize this work.

Aug 25 '24 18:08 jitingxu1

It can be pretty reasonable to ask questions like "what if we add this feature, how much uplift do we get?". If you ever want to do stuff like that ... it would sure help to be able to hyperparam that.

It's not unreasonable given the current design not to allow for this though. Feels like fair game considering that it is a different backend.

Aug 25 '24 20:08 koaning

It can be pretty reasonable to ask questions like "what if we add this feature, how much uplift do we get?". If you ever want to do stuff like that ... it would sure help to be able to hyperparam that.

I agree, this is a very fundamental part of the workflow.

It's not unreasonable given the current design not to allow for this though. Feels like fair game considering that it is a different backend.

Part of the benefit of Ibis (and, by extension, IbisML) is that you can scale by choosing a more appropriate backend, instead of changing code. The way I see it (actually, the way I'm working on a large dataset myself), it would be a very reasonable user workflow to want to experiment locally on a smaller using the DuckDB backend, then scale on the full dataset using a distributed backend. As a result, I want to be able to use IbisML to define my Recipe, use it in a scikit-learn pipeline, and tune hyperparameters on smaller data.

I never planned to tune hyperparameters on the full, multi-TB dataset (what's the point?), but I do want to run the same Recipe/pipeline with the established hyperparameter values.

Aug 26 '24 17:08 deepyaman

It makes sense to support this, but I wanted to highlight a few potential scenarios to consider.

Did some investigation this morning:

To enable this, we need to implement a scikit-learn compatible splitter in IbisML. In scikit-learn, the splitter returns indices for each split, and the dataset is materialized during the fit process. Currently, scikit-learn supports pandas, numpy, polars, and list_index. I suspect the _list_index with key_type == bool could work with Ibis tables, but this requires further investigation.
Another approach could be enabling scikit-learn to directly support ibis Table.
Lastly, we should work on making the get_param and set_param functions more granular.
Additionally, while it's not an immediate priority, if there’s no preprocessing parameter tuning, it might be more efficient to apply splitting on the Ibis preprocessed dataset in scikit-learn to minimize data movement.

some update here:

~~We may not have the bandwidth to tackle this in the next 2 weeks, but~~

it’s something we plan to address soon.

Aug 26 '24 20:08 jitingxu1

I guess there is also another thing to consider: isn't it the goal of IbisML to not just support scikit-learn but also other frameworks? I am mainly mentioning this because there is a risk of overfitting to sklearn too.

Aug 26 '24 21:08 koaning

I guess there is also another thing to consider: isn't it the goal of IbisML to not just support scikit-learn but also other frameworks? I am mainly mentioning this because there is a risk of overfitting to sklearn too.

It's a bit of both. :) As you say, we don't want to (and possibly can't) overfit to scikit-learn. For example, re some of the concerns brought up in https://github.com/ibis-project/ibis-ml/issues/135#issuecomment-2307729193—we probably don't need to overly focus on enabling things like caching. Especially as users look to scale, the right approach with IbisML may involve adding a bit of simple custom code or something. For hyperparameter tuning, from some limited research when the IbisML project was started, it sounded like a lot of the target users in industry were using things like Optuna, so we may focus more on enabling that workflow.

That said, scikit-learn is the most popular way for Python-oriented data scientists to work, and there's an extent of wanting to meet users where they are/fit into existing workflows where it's reasonable.

To that end, I can probably look into at least:

supporting get_param/set_param on steps
doing a bit more digging to figure out how we can enable things like grid search, even if it's not most efficient

Aug 26 '24 22:08 deepyaman

let's start with sklearn and consider other framework during design and implementation if possible. Thank you.

pytorch and tensorflow itself does not support kfold cv, some high level lib may do this.

Aug 26 '24 22:08 jitingxu1