community icon indicating copy to clipboard operation
community copied to clipboard

KEP-897: Propose centralized experiment tracking in Kubeflow

Open mprahl opened this issue 4 months ago • 11 comments

GitHub issue: #897

This proposal aims to resolve the current fragmented and limited experiment tracking experience by expanding the Kubeflow Model Registry into a unified, centralized metadata store. Currently, experiment tracking is scattered across components like Kubeflow Pipelines (which requires pipeline execution for tracking) and Katib (limited to hyperparameter tuning). This leads to challenges such as limited flexibility for direct logging from Python scripts or Jupyter notebooks, a fragmented user experience across multiple interfaces, and maintenance difficulties due to reliance on the inactive MLMD project.

mprahl avatar Aug 01 '25 19:08 mprahl

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign juliusvonkohout for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Aug 01 '25 19:08 google-oss-prow[bot]

Thank you for driving this @mprahl! Please can you create a tracking issue under kubeflow/community, so you can get the KEP number ?

It would be also good to also mention the history as I mentioned here: https://github.com/kubeflow/community/issues/783

As previously discussed in

  • https://github.com/kubeflow/community/issues/783
  • https://github.com/kubeflow/community/issues/238

andreyvelich avatar Aug 08 '25 18:08 andreyvelich

I'm closing this KEP because my team no longer has capacity to take this on. If others want to pursue this, feel free to fork the KEP and I'll be happy to review and advise. :smile:

mprahl avatar Sep 05 '25 21:09 mprahl

@mprahl may we keep it open for now? Just to have it tracked.

The stalebot will close it anyway if there is no activity on this topic

juliusvonkohout avatar Sep 26 '25 13:09 juliusvonkohout

I agree with @juliusvonkohout!

Maybe we should put out a call for contributors to help us add Experiment Tracking support via MLFlow for Kubeflow sub-projects. This feels like a really important capability that many of our users are asking for, and moving it forward would have a big impact on usability and Kubeflow adoption.

cc @kubeflow/wg-training-leads @kubeflow/wg-pipeline-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-manifests-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-data-leads @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-outreach-committee @jbottum

andreyvelich avatar Sep 26 '25 14:09 andreyvelich

Rather than tying it strictly to MlFlow implementation choice, I believe it would be very helpful to add an SPI (strongly inspired to MlFlow Exp/Run to begin with) so that if one day you want to tie other integration in this area you could.

Not to dispute MlFlow king popularity, but in other community discussions other alternatives have also their market-share, so an SPI would allow to prepare the ground for as well additional contributor, to what Andrey just said.

What would be the @kubeflow/kubeflow-steering-committee pov on this?

tarilabs avatar Sep 26 '25 14:09 tarilabs

I fully agree - designing an extensible architecture makes sense, since it will let us easily swap between experiment tracking solutions (e.g., MLflow, W&B, or even custom option). My only question is: in the short to medium term, what approach should we take to deliver the most value to users?

andreyvelich avatar Sep 26 '25 14:09 andreyvelich

My only question is: in the short to medium term, what approach should we take to deliver the most value to users?

Very IMHO an SPI that is 1:1 to the MlFlow API (with MlFlow integration as its implementation) in the short term. I'm aware is very limiting and naive, but at least forces to identify where the boundary for this integration lies. In turn, it should indeed make it easier to "direct" contributors/GSoC students if they want to integrate W&B (found the https://github.com/kubeflow/community/pull/892#discussion_r2263805436 ! 😄 ) or other tracking system, next.

tarilabs avatar Sep 26 '25 14:09 tarilabs

Experiment tracking is heavily dependent on Registry and UI to support it for visualizations, and tracking models and versions and metrics. What are thoughts on that when speak out this SPI based integration?

If we say SPI enables them to capture data and lets the users use the native tools they integrated with, for example using MlFlow UI separately? My next question is how do we foresee we bring back the champion model back into Kubeflow Model Registry for deployment or management? or do we need to? For me, this defines the scope of Model registry activities too going forward. Thoughts?

rareddy avatar Sep 26 '25 16:09 rareddy

I've reached out to the MLflow community to see their willingness for me to contribute a multi-tenancy feature which would allow us to have a single MLflow instances for a Kubeflow installation. Then the Kubeflow community (could be Pipeline WG) could maintain an MLflow plugin to handle Kubernetes RBAC requirements: https://github.com/mlflow/mlflow/issues/5844#issuecomment-3363085412

mprahl avatar Oct 09 '25 19:10 mprahl

I've reached out to the MLflow community to see their willingness for me to contribute a multi-tenancy feature which would allow us to have a single MLflow instances for a Kubeflow installation. Then the Kubeflow community (could be Pipeline WG) could maintain an MLflow plugin to handle Kubernetes RBAC requirements: mlflow/mlflow#5844 (comment)

Thank you very much. ping me on slack if you need help.

juliusvonkohout avatar Oct 28 '25 14:10 juliusvonkohout