KEP-897: Propose centralized experiment tracking in Kubeflow
GitHub issue: #897
This proposal aims to resolve the current fragmented and limited experiment tracking experience by expanding the Kubeflow Model Registry into a unified, centralized metadata store. Currently, experiment tracking is scattered across components like Kubeflow Pipelines (which requires pipeline execution for tracking) and Katib (limited to hyperparameter tuning). This leads to challenges such as limited flexibility for direct logging from Python scripts or Jupyter notebooks, a fragmented user experience across multiple interfaces, and maintenance difficulties due to reliance on the inactive MLMD project.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign juliusvonkohout for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Thank you for driving this @mprahl! Please can you create a tracking issue under kubeflow/community, so you can get the KEP number ?
It would be also good to also mention the history as I mentioned here: https://github.com/kubeflow/community/issues/783
As previously discussed in
- https://github.com/kubeflow/community/issues/783
- https://github.com/kubeflow/community/issues/238
I'm closing this KEP because my team no longer has capacity to take this on. If others want to pursue this, feel free to fork the KEP and I'll be happy to review and advise. :smile:
@mprahl may we keep it open for now? Just to have it tracked.
The stalebot will close it anyway if there is no activity on this topic
I agree with @juliusvonkohout!
Maybe we should put out a call for contributors to help us add Experiment Tracking support via MLFlow for Kubeflow sub-projects. This feels like a really important capability that many of our users are asking for, and moving it forward would have a big impact on usability and Kubeflow adoption.
cc @kubeflow/wg-training-leads @kubeflow/wg-pipeline-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-manifests-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-data-leads @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-outreach-committee @jbottum
Rather than tying it strictly to MlFlow implementation choice, I believe it would be very helpful to add an SPI (strongly inspired to MlFlow Exp/Run to begin with) so that if one day you want to tie other integration in this area you could.
Not to dispute MlFlow king popularity, but in other community discussions other alternatives have also their market-share, so an SPI would allow to prepare the ground for as well additional contributor, to what Andrey just said.
What would be the @kubeflow/kubeflow-steering-committee pov on this?
I fully agree - designing an extensible architecture makes sense, since it will let us easily swap between experiment tracking solutions (e.g., MLflow, W&B, or even custom option). My only question is: in the short to medium term, what approach should we take to deliver the most value to users?
My only question is: in the short to medium term, what approach should we take to deliver the most value to users?
Very IMHO an SPI that is 1:1 to the MlFlow API (with MlFlow integration as its implementation) in the short term. I'm aware is very limiting and naive, but at least forces to identify where the boundary for this integration lies. In turn, it should indeed make it easier to "direct" contributors/GSoC students if they want to integrate W&B (found the https://github.com/kubeflow/community/pull/892#discussion_r2263805436 ! 😄 ) or other tracking system, next.
Experiment tracking is heavily dependent on Registry and UI to support it for visualizations, and tracking models and versions and metrics. What are thoughts on that when speak out this SPI based integration?
If we say SPI enables them to capture data and lets the users use the native tools they integrated with, for example using MlFlow UI separately? My next question is how do we foresee we bring back the champion model back into Kubeflow Model Registry for deployment or management? or do we need to? For me, this defines the scope of Model registry activities too going forward. Thoughts?
I've reached out to the MLflow community to see their willingness for me to contribute a multi-tenancy feature which would allow us to have a single MLflow instances for a Kubeflow installation. Then the Kubeflow community (could be Pipeline WG) could maintain an MLflow plugin to handle Kubernetes RBAC requirements: https://github.com/mlflow/mlflow/issues/5844#issuecomment-3363085412
I've reached out to the MLflow community to see their willingness for me to contribute a multi-tenancy feature which would allow us to have a single MLflow instances for a Kubeflow installation. Then the Kubeflow community (could be Pipeline WG) could maintain an MLflow plugin to handle Kubernetes RBAC requirements: mlflow/mlflow#5844 (comment)
Thank you very much. ping me on slack if you need help.