ibis-ml
ibis-ml copied to clipboard
feat: support more parts of end-to-end ML workflow
Objectives
- Provide context on the data preprocessing, feature engineering, and model training ML workflow to inform the scope of Ibis-ML
- Propose direction and deliverables for Q1 and Q2 (and roadmap items for further down the road)
TL;DR
Start at the "Alternatives considered" section.
Constraints
- Ibis-ML will focus on enabling data processing workloads for ML on tabular data
- Ibis-ML will be a standalone extension lib that depends on Ibis
- Excludes domain-specific preprocessing like NLP, computer vision, and large language models
- Does not address exploratory data analysis (EDA) or model training-related procedures
Mapping the landscape
Data processing for ML is a broad area. We need a strategy to differentiate our value and narrow it down to what we can provide immediate value.
Breaking down an end-to-end ML pipeline
Stephen Oladele’s neptune.ai blog article provides a high-level depiction of a standard ML pipeline.
Source: https://neptune.ai/blog/building-end-to-end-ml-pipeline
The article also describes each step of the pipeline. Based on the previously-established constraints, we will limit ourselves to the data preparation and model training components.
The data preparation (data preprocessing and feature engineering) and model training parts can be further subdivided into a number of processes:
- Feature creation: Creating new features from existing ones or combining different features to create a new one.
- Feature publishing: Pushing to a feature store to be used for training and inference by the entire organization.
- Training dataset generation: Constructing training data by (if necessary, retrieving, and) joining features.
- Data segregation: Splitting data into training, testing, and validation sets.
- Cross validation: https://scikit-learn.org/stable/modules/cross_validation.html
- Hyperparameter tuning: https://scikit-learn.org/stable/modules/grid_search.html
- Feature preprocessing
- Feature standardization/normalization: Converting the feature values into similar scale and distribution values. Usually falls under model preprocessing.
- Feature cleaning: Treating missing feature values and removing outliers by capping/flooring them based on code implementation.
- Feature selection: Select the most appropriate features to be cleaned and engineered. A number of automated algorithms exist.
[!NOTE]
The above list of processes is adapted from the linked article. I've updated some of the definitions based on my experience and understanding.
Feature comparison (WIP)
| Tecton | Scikit-learn | BigQuery ML | NVTabular | Dask-ML | Ray | |
|---|---|---|---|---|---|---|
| Feature creation | ✅ | ❌ | ❌ | Partial | ❌ | |
| Feature publishing | ✅ | ❌ | Partial | ❌ | ❌ | |
| Training dataset generation | ✅ | ❌ | ✅ | ❌ | ❌ | |
| Data segregation | ❌ | ✅ | Partial | ❌ | ✅ | |
| Cross validation | ❌ | ✅ | ❌ | ❌ | ✅ | |
| Hyperparameter tuning | ❌ | ✅ | ✅ | ❌ | ✅ | |
| Feature preprocessing | ❌ | ✅ | ✅ | ✅ | ✅ | |
| Feature selection | ❌ | ✅ | ❌ | ❌ | ❌ | |
| Model training | ❌ | ✅ | ✅ | ❌ | ✅ | |
| Feature serving | ✅ | ❌ | Partial | ❌ | ❌ |
Details
Tecton
- Feature creation: Yes
- This is one of Tecton’s core value propositions. They support Spark and Rift (proprietary Python-based compute engine) for feature definition. Rift allows a broader range of Python transformations (i.e. not just SQL-like operations, and avoiding UDFs).
- Feature publishing: Yes
- The other half of Tecton’s core capabilities.
- Training dataset generation: Yes
- In Tecton, this involves first retrieving published features: https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-training/constructing-training-data
- Data segregation: No
- Cross validation: No
- Hyperparameter tuning: No
- Feature preprocessing: No
- Together with model development, this is delegated to another library (e.g. scikit-learn).
- Feature selection: No
- Model training: No
- Feature serving: Yes
- https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-inference
Scikit-learn
- Feature creation: No
- Feature publishing: No
- Training dataset generation: No
- Data segregation: Yes
- Cross validation: Yes
- Hyperparameter tuning: Yes
- Feature preprocessing: Yes
- Feature selection: Yes
- Model training: Yes
- Feature serving: No
BigQuery ML
- Feature creation: No
- Just write SQL in BigQuery itself. 🙂
- Feature publishing: Partial
- Training dataset generation: Yes
- Either pull a BigQuery table or fetch data from Vertex AI Feature Store, depending on if features are published.
- Data segregation: Partial
- Pass
DATA_SPLIT_*parameters to yourCREATE MODELstatement to control how train-test splitting is done. You can’t extract the split dataset.
- Pass
- Cross validation: No (automated?)
- Hyperparameter tuning: Yes
- Pass
HPARAM_*parameters to yourCREATE MODELstatement.
- Pass
- Feature preprocessing: Yes
- E.g. https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-imputer
- Feature selection: No
- Does have https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-importance
- Model training: Yes
- Feature serving: Partial
- See feature publishing
NVTabular
- Feature creation: Partial
- LambdaOp and JoinExternal enable very simple row-level feature engineering
- Feature publishing: No
- Training dataset generation: No
- Data segregation: No
- Cross validation: No
- Hyperparameter tuning: No
- Feature preprocessing: Yes
- Feature selection: No
- Model training: No
- Feature serving: No
Dask-ML
- Feature creation: No
- Feature publishing: No
- Training dataset generation: No
- Data segregation: Yes
- Cross validation: Yes
- Hyperparameter tuning: Yes
- Feature preprocessing: Yes
- Feature selection: No
- Model training: Yes
- Feature serving: No
Ray
- Feature creation:
- Feature publishing:
- Training dataset generation:
- Data segregation:
- Cross validation:
- Hyperparameter tuning:
- Feature preprocessing:
- Feature selection:
- Model training:
- Feature serving:
Ibis-ML product hypotheses
Scope
- A library needs to solve a sufficiently-large problem to be adopted widely. To this end, we want to provide value in multiple stages of the ML pipeline.
- (Domain-driven) feature engineering is already handled sufficiently well by Ibis. As with other tools that are part of an ecosystem that already supports data transformation (e.g. BigQuery, Dask), we leave feature engineering to the existing tooling (i.e. Ibis).
- Feature publishing, retrieval, and serving are orthogonal and can be left to feature platforms.
- Model training can’t be done by a database, unless they control the underlying (cloud) infrastructure and can treat it as another distributed compute problem (e.g. in the case of BigQuery ML, Snowpark ML). Ibis doesn’t control the underlying compute infrastructure.
- In practice, hyperparameter tuning in industry/large companies is often delegated to purpose-fit tools like Optuna.
- The remainder is (potentially) in scope.
- Data segregation and cross validation are required in most every tabular ML problem.
- @jcrist "Re: test/train splits and CV things - I do think we can provide some utilities in the initial release for handling test/train splitting or CV work, but for now I suggest we mostly focus on single model pipelines and already partitioned data. We ingest an
ibis.Tableas training data, we don't need to care for now where it's coming from or how the process upstream was handled IMO."
- @jcrist "Re: test/train splits and CV things - I do think we can provide some utilities in the initial release for handling test/train splitting or CV work, but for now I suggest we mostly focus on single model pipelines and already partitioned data. We ingest an
- Feature preprocessing is a good fit for Ibis to provide value. Technically, a user with a model pipeline (e.g. scikit-learn) may already include that in their pipeline, so they may or may not leverage this.
- (Automated) feature selection is more case-by-base, and therefore a lower priority.
- Data segregation and cross validation are required in most every tabular ML problem.
Alternatives considered
End-to-end IMO also means that you should be able to able to go beyond just preprocessing the data. There are a few different approaches here:
- Ibis-ML supports fitting data preprocessing steps (during the training process) and applying pre-trained Ibis-ML preprocessing steps (during inference).
- Pros: Ibis-ML is used during both the training and inference process
- Cons: Ibis-ML only supports data preprocessing, and even then a subset of steps that can be fit in database (e.g. not some very widely-used steps like PCA, that fit in the middle of the data-preprocessing pipeline)
- Ibis-ML supports constructing transformers from a wider range of pre-trained preprocessors and models (from other libraries, like scikit-learn), and applying them across backends (during inference).
- Pros: Ibis-ML gives users the ability to apply a much wider range of steps in the ML process during inference time, including preprocessing steps that can be fit linearly (e.g. PCA) and even linear models (e.g. SGDRegressor, GLMClassifier). You can even showcase the end-to-end capabilities just using Ibis (from raw data to model outputs, all on your database, across streaming and batch, powered by Ibis)
- Cons: Ibis-ML doesn't support training the preprocessors on multiple backends; the expectation is that you use a dedicated library/existing local tools for training
- A combination of 1 & 2, where Ibis-ML supports a wider range of pre-processing steps and models, but only a subset support a fit method (those that don't need to be constructed
.from_sklearn()or something).- Pros: Support the wider range of operations, and also fitting everything on the database in simple cases.
- Cons: ~Confusing? If I can train some of my steps using Ibis-ML, but for the rest I have to go a different library, it doesn't feel very unified.~ @jcrist makes a good point that it's not so confusing, because of the separation of transformers and steps.
Proposal
I propose to go with option #3 of the alternatives considered. In practice, this will mean:
- Keeping the existing structure of Ibis-ML
- Adding the ability to construct transforms
from_sklearn(and, in the future, potentially other libraries)- Some of the transforms may not be steps you can fit using Ibis-ML
This also means that the following will be out of scope (at least, for now):
- Train-test split (may have value to add in the future)
- CV (may have value to add in the future)
- Hyperparameter tuning (less hypothesized value; probably better to integrate with existing frameworks like Optuna)
Deliverables
Guiding principles
- At launch, we should showcase an end-to-end Ibis-ML workflow, from preprocessing to model inference.
- The goal is to get people excited about Ibis-ML, and for them to try the example(s).
- In future releases, we will increase the number of methods we support for each step. If we are successful in the first sub-goal (about generating excitement), people in the community will provide direction for and even contribute to this effort.
- The library should be incrementally adoptable. The user should get benefit out of using just Ibis data segregation or feature preprocessing, and then they should be able to move on to adding another piece, and get further value.
Demo workflows
- Fit preprocessing on DuckDB (local experience, during experimentation)
- Experiment with different features
- Fit finalized preprocessing on larger dataset (e.g. from BigQuery)
- Perform inference on larger dataset
We are currently targeting the NVTabular demo on RecSys2020 Challenge as a demo workflow.
We need variants for all of:
- scikit-learn (we already have)
- XGBoost
- PyTorch
With less priority:
- LightGBM
- Tensorflow
- CatBoost
High-level deliverables
P0 deliverables must be included in the Q1 release. The remainder are prioritized opportunistically/for future development, but priorities may shift (e.g. due to user feedback).
- [x] ~[P0] Support handoff to XGBoost (for training and inference)~ Update:
to_dmatrix/to_dask_dmatrixare already implemented - [ ] [P0] Support handoff to PyTorch (for training and inference)
- [ ] [P0] Build demo workflows
- [ ] [P0] Make documentation ready for "initial" release
- [ ] [P0] Increase coverage of Ibis-ML preprocessing steps w.r.t.
tidymodels - [ ] [P1] Increase coverage of data processing transformer(s)
from_sklearn - [ ] [P2] Increase coverage of model prediction transformer(s)
from_sklearn(i.e. those with predict functions that don't require UDFs) - [ ] [P2] Support handoff to LightGBM (for training and inference)
- [ ] [P2] Support handoff to Tensorflow (for training and inference)
- [ ] [P2] Support handoff to CatBoost (for training and inference)
- [ ] [P2] Support (demo?) inference in streaming contexts
- [ ] [P3] Support constructing some data preprocessing transformer(s)
from_sklearn(e.g. PCA, or some more frequently used step) - [ ] [P3] Support constructing some (linear) model prediction transformer(s)
from_sklearn(e.g. SGDRegressor)
Questions for validation
- Does being able to perform inference for certain model classes directly on the database, without UDF, provide real value? Are models with linear predict too narrow a category for people to care?
Changelog
2024-03-19
Based on discussion around the Ibis-ML use cases and vision with stakeholders, some of the priorities have shifted:
- Ibis-ML should leverage the underlying engine during both training and inference, and speeding up the training iteration loop on big data is a key value proposition. Therefore, support for constructing transformers
from_sklearnis no longer a priority, from P0 to P3.- The associated demo of scaling inference only is also removed.
- Increasing coverage of ML preprocessing steps w.r.t. tidymodels recipes and
sklearn.preprocessingis a higher priority. We break down the relative priority of implementing steps in a separate issue.
Thanks @deepyaman put them together.
Allowing users to complete all data-related tasks before model training would be highly beneficial without switching to other tools. Considering the necessity for users to grasp the data thoroughly before selecting suitable features and preprocessing strategies, integrating EDA(univariate, Correlation Analysis, and feature importances) into the feature engineering phase becomes imperative. This approach ensures that users are equipped with a comprehensive understanding of the data, empowering them to make informed decisions during the feature selection and preprocessing stages.
Does not address exploratory data analysis (EDA) or model training-related procedures
Thanks @deepyaman put them together.
Allowing users to complete all data-related tasks before model training would be highly beneficial without switching to other tools. Considering the necessity for users to grasp the data thoroughly before selecting suitable features and preprocessing strategies, integrating EDA(univariate, Correlation Analysis, and feature importances) into the feature engineering phase becomes imperative. This approach ensures that users are equipped with a comprehensive understanding of the data, empowering them to make informed decisions during the feature selection and preprocessing stages.
Does not address exploratory data analysis (EDA) or model training-related procedures
I agree that it could be valuable to handle more where Ibis is well-suited (e.g. some EDA). Your open issue on the ibis repo is very relevant. W.r.t. model training, that ultimately would need to be handled by other libraries, but we should make sure that the handoffs are smooth and efficient.
Feature engineering is a much bigger topic; I could see Ibis-ML expanding in that direction, to include some auto-FE (a la Featuretools), but it's not clear whether that's a priority. It's also a bit separate from the initial focus.
For consideration from @jcrist just now: Consider something like transform_sklearn(est, table) -> table over from_sklearn(est) -> some_new_type to avoid naming/designing the some_new_type object.
@deepyaman: The some_new_type could just be a transform (or step post-refactor?); check which option will be easier.
IbisML 0.1.0 is released and covers most of this.