Track TrainJob progress and expose training metrics
What you would like to be added?
This feature proposes 1) the definition of a standard "contract" for training runtimes to expose / push the fine-grained state of the training loops and 2) the implementation of a mechanism that updates the TrainJob resource statuses to reflect the training state / progress in real-time.
The status should, among other information:
- Include a percentage (e.g. current steps / total steps) but an ETA would also be very useful
- Include the training metrics that are relevant for HPO with Katib
The implementation outline could be:
- Define a schema for the training runtime to expose the training loop metrics
- Instrument training loops to periodically write their progression / status in the above format, e.g. to their rank 0 node standard output
- For custom trainers, provide examples showing how to instrument the training loop, e.g. for HuggingFace Transformers Trainer callbacks
- For built-in trainers, we may want to seamlessly instrument the runtime
- Augment the TrainJob controller to watch rank 0 nodes of running TrainJobs to read the metrics and update the corresponding TrainJob statuses
One benefit of this approach would be to not add any extra RBAC / security requirements for the TrainJob Pods that would still be able to run using the default service account.
Why is this needed?
Model training is an iterative process whose progression in time is fairly predictable, which makes tracking the progression of train jobs both possible, desirable and useful.
While a training job progression is usually accessible by reading the job rank 0 node logs, it might not be the best user experience for AI practitioners, nor provide the more robust mechanism for clients to access / parse this information.
Exposing the training metrics in real-time to the TrainJob API status will also unlock integration with other components like Katib, GUIs and possibly experimentation tracking solutions.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
I would vote for status if possible.
APIService would require authentication and probably a certificate for more secure platforms.
I'd be also inclined to favor the status approach.
FWIW Kueue relies on APIService for the visibility API, and cert-controller that's used for webhooks can also be used to manage the serving certificate.
/assign
/area tracking /remove-label lifecycle/needs-triage
I've updated the description following the discussion about Katib and TrainJob integration we had during this week Kubeflow SDK & ML Experience community call so it includes tracking training metrics in general and not only training progress (time-wise).
I've updated the description following the discussion about Katib and TrainJob integration we had during this week Kubeflow SDK & ML Experience community call so it includes tracking training metrics in general and not only training progress (time-wise).
Awesome, Thanks a mil @astefanutti ! I will follow it accordingly..
Hi folk, I've made a PR with a KEP for this feature - https://github.com/kubeflow/trainer/pull/2905.
We'd love to start the discussion on how best to implement this feature.
/milestone v2.2