yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[Proposal] MLflow Pipelines (previously named as MLX)

Open mengxr opened this issue 2 years ago • 18 comments

Proposal Summary

(You can find the latest version of the proposal in this Google doc.)

What are you trying to do? Articulate your objectives using absolutely no jargon.

We want to introduce MLflow Pipelines (previously named as MLX), an opinionated approach for MLOps, as part of MLflow’s next major release. It adds two key features to MLflow:

  • Predefined ML pipelines to address common ML problems at production quality.
  • Utilization of efficient pipeline execution engines to accelerate ML development.

Our objective is to enable data scientists to stay mostly within their comfort zone utilizing their expert knowledge while following the best practices in ML development and delivering production-ready ML projects, with little help from production engineers and DevOps.

Watch a proof-of-concept video here:

PoC

How is it done today, and what are the limits of current practice?

Almost every company wants to be a Data+AI company. But many are new to ML under production settings, where the core ML component is a relatively small component surrounded by many others like data collection and verification, testing and debugging, automation via CI/CD, resource provision, model management and monitoring, etc. This is where MLOps kicks in and its job is to connect various components needed for ML productionization.

However, despite being an emerging topic, MLOps is hard and there are no widely established approaches for MLOps. In a Databricks blog post Need for Data-centric ML Platforms, the authors summarized MLOps as ModelOps + DataOps + DevOps. Connecting components from different areas naturally makes MLOps hard. What makes the job even harder is that in many companies the ownership of MLOps usually falls through the cracks between data science teams and production engineering teams. Data scientists are mostly focused on modeling the business problems and reasoning about data, features, and metrics, while the production engineers/ops are mostly focused on traditional DevOps for software development, ignoring ML-specific Ops like ML development cycles, experiment tracking, data/model validation, etc.

One solution to this problem is to hire “full-stack” data scientists or engineers who are capable of doing everything end-to-end. But only a few companies can afford this luxury solution.

Another candidate solution is to adopt the MLOps frameworks open-sourced by companies that have matured MLOps practices, for example, TFX from Google. However, those frameworks were originally designed for internal users to solve likely complex problems. They are not simple enough for the majority of data scientists to solve less sophisticated ML problems.

With ~10 million monthly downloads, MLflow is the most popular open-source MLOps framework. It is used by many users to track ML lifecycle and manage projects, metadata, and models. However, it itself is missing the “flow” part that helps streamline an end-to-end pipeline. This proposal aims to fill in that gap and greatly simplify the end-to-end story.

What is new in your approach and why do you think it will be successful?

We propose an opinionated approach to simplify and standardize ML development and productionization. On one hand, we offer data scientists production-quality ML pipeline templates they can start with and iterate quickly. On the other hand, we offer production engineers command-line interfaces for easy CI/CD integration. We focus on simplicity and try to address less sophisticated ML use cases, which we believe are the majority.

Below are the key differentiators:

  • Pre-defined production-quality ML pipeline templates. We will build pipeline templates that match common ML problem types, e.g., regression, clustering, recommendation, etc, where we embed best practices collected from industry experts. Instead of constructing an ML pipeline end-to-end, data scientists first pick a pipeline template that matches the problem type to start a project and customize its steps to solve the problem. This leads to a declarative and config-driven approach that saves the boilerplate code. So they can focus on the modeling steps while delivering production-ready projects. We believe the pre-defined templates would have a good coverage of the ML problems.

  • Efficient pipeline execution engine. ML development is a very iterative process. Users frequently jump between steps in a pipeline to understand the problem and improve the model. We will adopt an efficient pipeline engine to optimize the execution. After making changes, users only need to declare what they want and leave the engine to figure out what steps need to be executed and what can be reused. So they can quickly iterate during development, e.g., changing some features and then directly verifying feature importance.

  • Command-line interfaces to execute ML pipelines. During production handoff, instead of letting data scientists and production engineers negotiate the scripts, params, and I/O to be used in CI/CD, we want to standardize the interfaces to train and deploy models. The pre-defined pipelines automatically perform checks like schema and model validation. So production engineers have less things to worry about if they adopt our approach.

  • Notebook interfaces to execute ML pipelines. While we promote modular development for production, we love notebooks for data analysis, which is essential during iterative ML development. MLflow Pipelines provides notebook interfaces to trigger pipeline execution and display relevant results in cell outputs for data analysis and comparison.

Who cares? If you are successful, what difference will it make?

  • Data scientists. If MLflow Pipelines is successful, data scientists can stay mostly in the comfort zone utilizing their expert knowledge while delivering end-to-end production-quality projects.
  • Production engineers. If MLflow Pipelines is successful, production engineers can easily configure CI/CD for ML projects and let data scientists fully own the development cycles.

What are the risks?

MLflow Pipelines is an opinionated approach. The biggest risk is whether the proposed opinionated piece fits the target use cases well. To de-risk, we plan to build the first pipeline template as a proof-of-concept, make it an optional component in MLflow, and collect feedback from the MLflow community to see how it fits and decide the next step.

How much will it cost?

The initial work will be focused on creating the first pipeline template, e.g., regression, and providing both command-line and notebook interfaces. We will adopt an existing pipeline engine instead of building one to save cost. Our estimate is ~10PWs.

How long will it take?

We plan to release the first pipeline template in early May. Contributions from the community would help accelerate the development.

What are the mid-term and final “exams” to check for success?

  • Mid-term:
    • We receive positive feedback from the community on the first pipeline template.
    • MLflow Pipelines becomes a default component in MLflow’s next major release and ships with more production-quality pipeline templates.
  • Final:
    • Wide adoption of MLflow Pipelines.

Appendix

Terminology

  • Pipeline: An orchestration to solve one kind of machine learning problem end-to-end. It consists of steps, their inputs and outputs, and how they depend on each other. It is usually pre-defined by engineers and used by data scientists, who can configure the pipeline and its steps and customize certain steps to fit the specific problem to solve.
  • Step: An MLflow pipeline building block that usually does a single task, e.g., feature transformation. It declares inputs and outputs to chain with other steps in a pipeline. Users can configure its behavior via conf or Python code. Once configured, a step should be deterministic. Step names are verbs.
  • Run: A session that tracks the execution of an MLflow pipeline, fully or partially.
  • Profile: A named set of configurations users can activate when triggering a pipeline. Common profile names are “local”, “dev”, and “prod”.

Example development workflow

Mike is a data scientist. He liked the MLflow Pipelines tutorial and wanted to try MLflow Pipelines on a new ML project.

  • He already installed MLflow w/ MLflow Pipelines from the tutorial.
  • His new project is to predict used car sale prices. It is a regression problem. So he ran “mlp init –name autos –pipeline regression” to create a new project folder.
  • He used VS Code to open the generated project folder.
  • Inside the project folder, he saw a README.md file, a configuration file “pipeline.yaml”, a dependency file “requirements.txt”, and subfolders “steps/”, “notebooks/”, etc.
  • He opened README.md first and saw instructions and a list of TODOs:
    • [ ] Check/update “requirements.txt” and run “pip install -r requirements.txt”.
    • [ ] Open “notebooks/runme.ipynb” and try running this project first.
    • [ ] Update the primary evaluation metric to use.
    • [ ] Update data location in “pipeline.yaml” and target column.
    • [ ] Update sklearn estimator in “steps/train.py”.
    • [ ] …
  • He took a look at requirements.txt. The packages he needed were all listed. So he ran “pip install -r requirements.txt” directly.
  • He opens “notebooks/runme.ipynb” in VS Code. He already installed the Jupyter plugin. He activated the Python environment (kernel) for this project.
  • He clicked “Run All” and saw the pipeline visualization and outputs from its steps.
  • He checked the primary metric defined in “pipeline.yaml”. It is RMSE, which is good.
  • He checked the data path, which points to a local file under “data/”. He moved a parquet file that contains the sample training dataset to “data/” and updated the data path.
  • He changed the target column to “price”.
  • He switched back to “runme.ipynb” and re-ran the “ingest” cell, which displays the data summary of ingested data.
  • Then he re-ran the “evaluate” cell to see model performance on the real training dataset. The RMSE was more than $10,000, which is bad. He saw some training examples that have the worst prediction errors.
  • He opened “steps/train.py” and saw it uses an AutoML library to train. It takes a param to limit the cost. He increased the limit in the conf and hoped it would improve the model.
  • He switched back to “runme.ipynb”. He created a new cell and triggered the “evaluate” step again. He saw that the new model got RMSE $2,000, which is much better.
  • He re-ran the “explain” cell to see feature importance. One feature didn’t show up among the top ones. He knew the feature was very important, but it needed some parsing.
  • So he opened “steps/transform.py” and implemented a parser for that feature.
  • He found himself switching between the notebook and the source code. So he split the VS Code window and put the notebook and source code windows side by side.
  • He created a new cell and re-ran “evaluate”. The RMSE got improved to $1,500. He checked model importance again and confirmed the specific feature was among the top.
  • After a few more iterations, he successfully improved the RMSE to $1,000 on the sample training dataset.
  • He wanted to test it on the full training dataset.
  • He updated the “pipeline.yaml” file and added a new profile called “dev”. He configured the “ingest” step to read from the full dataset and increased the train cost limit again.
  • He updated the profile name and reran the notebook. It took much longer. He found the RMSE on the full dataset is $800, which was good enough to ship as the initial version.
  • He used git to check in the project folder to a repo shared by the data science team.
  • Btw, all the trials and models Mike made were tracked under MLflow automatically:)

mengxr avatar Mar 04 '22 21:03 mengxr

This is a very interesting proposal. A similar tool is offered in Visual Studio for .NET developers: https://docs.microsoft.com/en-us/dotnet/machine-learning/automate-training-with-model-builder. There are some differences, but I believe it is worth looking into.

I'm interested in helping, such as doing code reviews.

One question is which pipeline execution engine would you use?

sonichi avatar Mar 04 '22 23:03 sonichi

@sonichi Thanks for the references! I will take a look at ML.NET.

On the pipeline execution engine, I used bazel during prototyping. The pipeline engine won't be end-user facing (except installing some dependencies if manual installation is required). So the actual choice is not important in this phase as long as it meets the functional requirements.

mengxr avatar Mar 07 '22 16:03 mengxr

Overall, I strongly support some kind of pipeline execution native to MLflow. I think that would make it a much more complete platform for MLops. For some background on my questions, I'm a ML engineer with 3.5 years experience focused in deep learning computer vision. Some of my favorite tools are PyTorch Lightning, MLflow 😄, Raytune, Dolt, and Hydra. To start, I have a couple generic questions

  1. To me, MLX sounds similar to Kubeflow, so I'm curious how you would distinguish between the two, and what would convince a team to switch from Kubeflow given that it's already fairly mature in comparison? An obvious difference is that Kubeflow is native to Kubernetes for its execution engine / orchestration framework vs. MLX sounds like it would be more agnostic in that sense, but I'm curious about other considerations
  2. Related to the above, how will infrastructure be provisioned/deployed with MLX? Considering cloud, on-prem, containerization, distributed workloads, existing orchestration frameworks, GPUs, etc

Reading your proposal, I'd be interested in exploring flexibility for user-defined

  1. Pipeline templates: having some pre-made starting points is good, but I think many users will quickly find themselves needing to customize aspects beyond what can be accomplished with a config file. If there wasn't an option to subclass/extend the base templates, I worry that MLX wouldn't be used widely enough to gain traction in the community
  2. Execution engines: I see you mentioned Bazel above, which might be okay for an initial release. However, could the pipeline engine interface to MLX be designed so that it's adaptable for other framework integrations in the future? I've heard Airflow, Pachyderm, and other tools mentioned more often in the MLops community than Bazel, and it would be great to write integrations for future (unknown) tools as they come out since the landscape is evolving so rapidly

I'd also like to give a shout-out to the MLops.community Slack workspace (I have no affiliation, just a community member who thinks it's awesome). Currently there are over 8.7k members in there discussing best practices for MLops under a wide variety of use cases. I've shared this issue in their #mlops-questions-answered channel to promote further engagement, but I think the broader chat history could also be useful for MLflow contributors who want to read-up on additional opinions for brainstorming

addisonklinke avatar Mar 07 '22 17:03 addisonklinke

@addisonklinke MLX and Kubeflow serve different personas. The initial target for MLX is data scientists who are new to ML under production settings, which we believe are the majority.

IMHO Kubeflow is designed for engineers/ops. Take the MNIST example from kubeflow. The very first code cell would scare away many data scientists.

MLX proposes a two-layer solution:

  1. ML pipeline templates defined by engineers, which provide main skeletons to solve certain kinds of problems with guardrails during development and deployments. This could include a deployer to Kubeflow.
  2. Problem-specific conf/code defined by data scientists (mostly within their comfort zone), to customize the project to fit concrete problems.

In the current proposal, we host a few pipeline templates in MLflow directly. I hope it would cover many ML problems already. But as you mentioned, it won't help solve all problems. For advanced problems, later we can open up pipeline/step APIs for engineers to create/extend/customize pipeline templates for data scientists to use.

Initially, the pipeline engine choice is an implementation detail. After we validate the main idea of MLX has a good product fit, future development can make pipelines compile down to different pipeline execution engines.

Thank you for sharing this proposal with MLops.community! I will watch discussion there as well.

mengxr avatar Mar 07 '22 18:03 mengxr

it would be great if the structure would be abstract enough to support custom runners. So the engine can compute what need to run, but then the experiments themselves are submitted to an arbitrary infrastructure. There could be support for kubernetes out of the box, but then the user could write a custom runner to run experiments on whatever infrastructure.

giacomov avatar Mar 07 '22 19:03 giacomov

I like this direction, but would really want to see MLflow utilize or integrate with existing pipeline tools like Argo Workflows or Apache Airflow. Maybe take a look at metaflow and how they are enabling step-like DAGs for inspiration?

zbloss avatar Mar 16 '22 14:03 zbloss

@zbloss Agreed, I think there's already a plethora of pipeline tools out there (Pachyderm, ClearML, MLRun, ZenML, Flyte, Kedro... the list goes on) so it would be nice to start coalescing around some industry standards

addisonklinke avatar Mar 16 '22 16:03 addisonklinke

Are there any code examples? For notebooks (Jupyter and Databricks) and non-notebook IDE-based code.

amesar avatar Mar 30 '22 05:03 amesar

@mengxr I am also wondering how this be integrated with other pipelines such as Airflow, as part of the pipelines are using Scala-based spark jobs do the heavy lifting transformation and featuring engineering. when data is large, we also using Spark to calculating the statistics comparison like you are demo with Facets-Overview. In that case, we use https://github.com/gopro/facets-overview-spark. How to transition from MLFLow pipeline to Airlfow Pipeline and Python to Scala, unless they are all using the same Pipeline execution engine.

chesterxgchen avatar Mar 31 '22 00:03 chesterxgchen

I broadly enjoy the idea, and this is largely an MLFlow-native version of any other pipeline library that exists (i.e. most of these have converaged to the same broad architecture of having a pipeline be a collection of steps with separate runs, etc).

But I think the interface of YAML is probably not the best idea. If your target user is truly a data scientist, context switching between YAML and python could present challenges.

I'd recommend taking some inspiration from e.g. Metaflow. I understand that YAML forces some constraints that are helpful during deployment, but you can still maintain such constraints with a pythonic API, which I think will come much more naturally to data scientists.

I have a lot of interest in this project -- would love to chat more about it.

skylarbpayne avatar Apr 05 '22 17:04 skylarbpayne

Are there any code examples? For notebooks (Jupyter and Databricks) and non-notebook IDE-based code.

Do you mean working code examples or mocked? In the attached video, I did a quick walkthrough of the designed workflow. But the code is mainly for demoing purpose, not ready for sharing :(

mengxr avatar Apr 07 '22 17:04 mengxr

@mengxr I am also wondering how this be integrated with other pipelines such as Airflow, as part of the pipelines are using Scala-based spark jobs do the heavy lifting transformation and featuring engineering. when data is large, we also using Spark to calculating the statistics comparison like you are demo with Facets-Overview. In that case, we use https://github.com/gopro/facets-overview-spark. How to transition from MLFLow pipeline to Airlfow Pipeline and Python to Scala, unless they are all using the same Pipeline execution engine.

  • We provide abstractions at the ML pipeline level. So later it is possible to compile it down to an Airflow DAG to execute on Airflow, although no ETA at this stage.
  • We plan to focus on Python only and use PySpark to handle distributed data transformation/aggregation. I think unlikely we are going to support Scala given most of the ML ecosystem built on top of Python.

mengxr avatar Apr 07 '22 17:04 mengxr

I broadly enjoy the idea, and this is largely an MLFlow-native version of any other pipeline library that exists (i.e. most of these have converaged to the same broad architecture of having a pipeline be a collection of steps with separate runs, etc).

The design of MLX is different from other pipeline frameworks in a way that we proposed a two-layer solution. One layer is the pre-defined pipeline templates created by engineers, similar to "compile -> test -> package -> publish" workflow in software development. The second layer is for data scientists to customize the templates to fit their own projects. Other pipeline frameworks like metaflow is less opinionated on how pipelines should look.

But I think the interface of YAML is probably not the best idea. If your target user is truly a data scientist, context switching between YAML and python could present challenges.

We plan to use YAML in the first version but later allow users to provide a Python script that generates the pipeline YAML. YAML is good for code review, where reviewers can easily see the diffs.

mengxr avatar Apr 07 '22 18:04 mengxr

I like this direction, but would really want to see MLflow utilize or integrate with existing pipeline tools like Argo Workflows or Apache Airflow. Maybe take a look at metaflow and how they are enabling step-like DAGs for inspiration?

@giacomov @zbloss @addisonklinke We don't plan to implement a pipeline execution engine. In the first stage, we will adopt an existing one. For example, the code used in the video demo used Bazel. make should work well. Essentially what we need is a build tool. In the future, we will support multiple pipeline execution engines like Airflow, Kubeflow, etc.

mengxr avatar Apr 07 '22 18:04 mengxr

This is a great direction for mlflow. The execution of pipelines is an essential part of ml operations, therefore, it makes sense to have pipelines managed in mlflow itself. I have a couple of questions/comments in this regard

  1. Where will the pipeline templates be stored? It might make sense to store the templates with the mlflow service itself for management and sharing of the templates. These templates will be as important an asset for an organization as model themselves.
  2. As pointed out earlier, users will need ability to create and design the templates because every data science project has its own unique requirements even if basic algorithms are similar. I will be happy to collaborate on this effort. Once template definition formats are decided it may be possible to parallelize the template design feature with the template integration in notebook workflows.

jnp avatar Apr 13 '22 07:04 jnp

@jnp Thanks for your feedback!

  1. Where will the pipeline templates be stored? It might make sense to store the templates with the mlflow service itself for management and sharing of the templates. These templates will be as important an asset for an organization as model themselves.

We plan to store the pipeline templates under mlflow github org, one repo per templates similar to GitHub Actions. After we open up APIs for pipelines and steps, 3rd-party can define their own templates in their own public/private repos or inline them inside a project.

  1. As pointed out earlier, users will need ability to create and design the templates because every data science project has its own unique requirements even if basic algorithms are similar.

We initially target less sophisticated ML problems. We do expect the official templates can cover the 80%. They won't fit advanced problems. We can open up APIs to define steps and pipelines. So advanced users can create their own templates. We still expect those templates are created by ML engineers within each organization and provided to data scientists.

mengxr avatar May 03 '22 16:05 mengxr

Hey folks!

We are looking for early users that can test MLflow Pipelines or/and help us extend it. If you are interested, please use this form to signup.

Thank you for helping us improve MLflow!

ahdbilal avatar May 17 '22 15:05 ahdbilal

A question in this regard, where is the training data stored as we move from step to step (regression template), is this one spark data frame (created by data step) that is accessed by sequence of steps ? What kind of parameters can one step pass to the other step ? Are steps reusable as well ?

indranilr avatar Jul 03 '22 23:07 indranilr