training-operator Track TrainJob progress and expose training metrics

What you would like to be added?

This feature proposes 1) the definition of a standard "contract" for training runtimes to expose / push the fine-grained state of the training loops and 2) the implementation of a mechanism that updates the TrainJob resource statuses to reflect the training state / progress in real-time.

The status should, among other information:

Include a percentage (e.g. current steps / total steps) but an ETA would also be very useful
Include the training metrics that are relevant for HPO with Katib

The implementation outline could be:

Define a schema for the training runtime to expose the training loop metrics
Instrument training loops to periodically write their progression / status in the above format, e.g. to their rank 0 node standard output
- For custom trainers, provide examples showing how to instrument the training loop, e.g. for HuggingFace Transformers Trainer callbacks
- For built-in trainers, we may want to seamlessly instrument the runtime
Augment the TrainJob controller to watch rank 0 nodes of running TrainJobs to read the metrics and update the corresponding TrainJob statuses

One benefit of this approach would be to not add any extra RBAC / security requirements for the TrainJob Pods that would still be able to run using the default service account.

Why is this needed?

Model training is an iterative process whose progression in time is fairly predictable, which makes tracking the progression of train jobs both possible, desirable and useful.

While a training job progression is usually accessible by reading the job rank 0 node logs, it might not be the best user experience for AI practitioners, nor provide the more robust mechanism for clients to access / parse this information.

Exposing the training metrics in real-time to the TrainJob API status will also unlock integration with other components like Katib, GUIs and possibly experimentation tracking solutions.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Aug 07 '25 08:08 astefanutti

I would vote for status if possible.

APIService would require authentication and probably a certificate for more secure platforms.

Aug 07 '25 20:08 kannon92

I'd be also inclined to favor the status approach.

FWIW Kueue relies on APIService for the visibility API, and cert-controller that's used for webhooks can also be used to manage the serving certificate.

Aug 11 '25 13:08 astefanutti

/assign

Sep 08 '25 09:09 abhijeet-dhumal

/area tracking /remove-label lifecycle/needs-triage

Oct 08 '25 16:10 andreyvelich

I've updated the description following the discussion about Katib and TrainJob integration we had during this week Kubeflow SDK & ML Experience community call so it includes tracking training metrics in general and not only training progress (time-wise).

Oct 10 '25 10:10 astefanutti

I've updated the description following the discussion about Katib and TrainJob integration we had during this week Kubeflow SDK & ML Experience community call so it includes tracking training metrics in general and not only training progress (time-wise).

Awesome, Thanks a mil @astefanutti ! I will follow it accordingly..

Oct 10 '25 12:10 abhijeet-dhumal

Hi folk, I've made a PR with a KEP for this feature - https://github.com/kubeflow/trainer/pull/2905.

We'd love to start the discussion on how best to implement this feature.

Oct 28 '25 14:10 robert-bell

/milestone v2.2

Jan 21 '26 23:01 andreyvelich