training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

feat(docs): KEP-2779: Track TrainJob progress and expose training metrics

Open robert-bell opened this issue 4 months ago • 10 comments

What this PR does / why we need it:

This is a KEP for adding real-time progress and exposing training metrics on TrainJob.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Part of #2779

robert-bell avatar Oct 28 '25 14:10 robert-bell

Pull Request Test Coverage Report for Build 21214073338

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 19 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 51.385%

Files with Coverage Reduction New Missed Lines %
pkg/runtime/core/trainingruntime.go 19 69.94%
<!-- Total: 19
Totals Coverage Status
Change from base Build 21036734925: 0.1%
Covered Lines: 1243
Relevant Lines: 2419

💛 - Coveralls

coveralls avatar Oct 31 '25 19:10 coveralls

Another proposed solution is as follows: Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates. this approach is much cleaner

  • No log parsing
  • No go-routines
  • No impact on the kubeapi-server for fetching and streaming logs
  • Only master/primary pod can update the status and other worker pods will check if it doesn't have the right permissions skip updating the status
  • Let the actual train job itself update itself

To add to this analysis, given the TrainJob "primary" Pods run arbitrary user code and extra packages, granting these Pods privilege access to the API server would warrant hardened security sandboxing.

astefanutti avatar Nov 03 '25 07:11 astefanutti

Another proposed solution is as follows: Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates. this approach is much cleaner

To add to this analysis, given the TrainJob "primary" Pods run arbitrary user code and extra packages, granting these Pods privilege access to the API server would warrant hardened security sandboxing.

Granting permissions so the train job itself can directly update the training status comes with a few more downsides -

  • The controller would also need to create a rolebinding for the serviceaccount, which may need some care to avoid privilege escalations (I'm not too sure of the details).
  • The controller may need to provide additional ways to configure the service account so users can, e.g. grant additional permissions to the service account (e.g. if the TrainJob needs to call the K8s api), or configure out-of-cluster permissions, e.g. for AWS IRSA.
  • If users are using a custom service account for their train job, they would need to manually add a role binding to allow the service account to update the train job.
  • The training runtime pods would require a kubernetes client to be available. Whilst this could be done (e.g. installing at the start of the job, or by using a sidecar containing the client to export the progress updates) it adds some complexity.

robert-bell avatar Nov 06 '25 18:11 robert-bell

Hi Folks, thanks again for this effort! That feature should significantly improve experience of AI workloads on Kubernetes, and has a lot of future potential for checkpointing, scheduling, etc. 🎉

@astefanutti @robert-bell Do we need to update KEP based on our conversation with @astefanutti and @tenzen-y at KubeCon, so we can review and move this forward?

cc @romanbaron @EkinKarabulut @Ronkahn21 as we discussed, this should address problems to detect running status of TrainJob mentioning by Ron in this talk: https://sched.co/28D8Z

andreyvelich avatar Nov 25 '25 00:11 andreyvelich

Hey @andreyvelich! Thanks for chasing, and apols for the silence.

@astefanutti @robert-bell Do we need to update KEP based on our conversation with @astefanutti and @tenzen-y at KubeCon, so we can review and move this forward?

I'm working on an update to bring in things from that discussion. I'm aiming to find some time this week.

robert-bell avatar Nov 25 '25 08:11 robert-bell

Hi folk, @astefanutti @andreyvelich @tenzen-y,

I've updated the proposal with a new design based on the discussions from Kubecon, and looked to incorporate some of the ideas from the first draft. The changes are put into separate commits to make it easier to see what's changed.

The main change is that I've replaced the log watching mechanism with http calls. I've actually made 2 variants of the proposal - a push-based one (runtime sends to control plane) and pull-based one (control plane scrapes from runtime) - because they're similar but have different trade offs. We only need one of the approaches though, and I thought this might be the easiest way to weigh up the merits of each an help us reach consensus on how best to move forward.

The other changes I've made are:

  • replaced the trainMetrics and evalMetrics fields with a more general MetricsGroup struct.
  • clarified the goals to make it clearer that we're not trying to replace tools like MLFlow.
  • updated the other considered alternatives section with a few more details.

Please let me know if any extra info is needed to help this move forward.

robert-bell avatar Dec 01 '25 15:12 robert-bell

/ok-to-test

andreyvelich avatar Dec 23 '25 01:12 andreyvelich

Hey @andreyvelich @tenzen-y, I've updated the KEP based on our conversation from last Wednesday. Please could you take another look?

The changes are in the last two commits but in short, I've:

  • made it clear we're proposing the push-based approach. I've moved the pull-based/scraping approach to "Othrer considered alternatives"
  • removed the CurrentStep, TotalSteps, CurrentEpoch and TotalEpochs fields. We can put them in the metrics list instead.
  • set the metrics value to a string. I haven't done anything about supporting string/int/bool as per Device Attributes which was mentioned in the call because string should be enough for a first version - maybe we could consider it as a later iteration?
  • removed the bits about the framework-specific custom trainers. If there's any interest in this maybe we can discuss it separately, but I'm happy to park it for now.

I think that covers everything we discussed on the call, but let me know if I missed anything or you have any other thoughts.

robert-bell avatar Jan 12 '26 15:01 robert-bell

Hey @andreyvelich @tenzen-y @astefanutti I've done another update.

Main changes since last time -

  • updated the path and added versioning to the new endpoint and added an example payload.
  • clarified what changes when the feature gate is enabled/disabled.

I've also raised a question here https://github.com/kubeflow/trainer/pull/2905#discussion_r2698388441 which could do with your input.

Please take a look, thanks.

robert-bell avatar Jan 16 '26 13:01 robert-bell

Awesome, thanks @andreyvelich, and thank you for your detailed feedback and help moving this forward.

robert-bell avatar Jan 19 '26 15:01 robert-bell

/lgtm Thanks @robert-bell

akshaychitneni avatar Jan 19 '26 18:01 akshaychitneni

/lgtm

Awesome work @robert-bell @abhijeet-dhumal thanks!

astefanutti avatar Jan 20 '26 18:01 astefanutti

/retest

andreyvelich avatar Jan 20 '26 19:01 andreyvelich

@andreyvelich @astefanutti @akshaychitneni @tenzen-y I pushed a minor update in response to one of the latest comments forgetting that it would reset the lgtm labels. Apols. Can you please re-review?

Is there anything else that we need for this? The next steps for implementing it are looking good and clear from my end, and we should be able to start on the implementation mid next week assuming everyone's happy with the current plan.

robert-bell avatar Jan 21 '26 14:01 robert-bell

/lgtm /approve /hold in case @tenzen-y has any additional comments.

andreyvelich avatar Jan 21 '26 16:01 andreyvelich

@robert-bell Thank you for moving that forward! LGTM 👍 /lgtm /approve

tenzen-y avatar Jan 22 '26 04:01 tenzen-y

/hold cancel

tenzen-y avatar Jan 22 '26 04:01 tenzen-y

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [andreyvelich,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Jan 22 '26 04:01 google-oss-prow[bot]