training-operator
training-operator copied to clipboard
[Release] Training Operator 1.8 Roadmap
This is the tracking issue for Training Operator 1.8 release. The feature freeze date for the next Kubeflow 1.9 release is April 15th.
We are targeting the following features for Training Operator 1.8:
SDK
- [x] Train/Fine-Tune API for LLMs
- [x] Unit tests for SDKs: https://github.com/kubeflow/training-operator/pull/1938
- [x] Consolidate SDK APIs: https://github.com/kubeflow/training-operator/issues/1877
- [x] Fetch Job and Pod events: https://github.com/kubeflow/training-operator/issues/1863
- [x] Set compute resources: https://github.com/kubeflow/training-operator/pull/1990
Backend
- [x] https://github.com/kubeflow/training-operator/issues/1999
- [x] Deprecation notice for MXJob: https://github.com/kubeflow/training-operator/issues/1996
- [x] https://github.com/kubeflow/training-operator/issues/1993
Misc
- [x] https://github.com/kubeflow/training-operator/issues/1998
- [ ] Update contributor guide.
- [x] Example for
torchrun
and PyTorchJob: https://github.com/kubeflow/training-operator/pull/1965
@deepanker13 @droctothorpe @tenzen-y @kubeflow/wg-training-leads @kuizhiqing @terrytangyuan @lowang-bh Please let me know items that we want to add for Training Operator 1.8.
cc @kubeflow/release-team
@johnugeorge @deepanker13 Do we need to create tracking issue with remaining items for Train/Fine-tune API for LLMs ?
I'd like to get https://github.com/kubeflow/training-operator/pull/1953 merged as well. I think the risk is pretty low.
@andreyvelich thanks for putting this together. On the "Misc: Improve docs for the training operator", if you can start a seprate issue highligintg known issues, doc areas to be improved or particular topics you want to address we can start coordinating with the release team doc leads as well to get some help.
I would suggest having a separate issue for autogen APIs, in case you want to address that as well.
@terrytangyuan Sure, can we discuss the MXJob deprecation plan on the next AutoML and Training WG meeting ? I think, it would be better if we are going to remove support for MXJob in 2 releases. For example, in 1.8 release we are going to inform users that MXJob will be removed in the next version. And when we release 1.9 we will remove MXJob. That should give sufficient time for users to migrate even that MXNet has already been archived. WDYT @kubeflow/wg-training-leads @tenzen-y ?
if you can start a seprate issue highligintg known issues
@StefanoFioravanzo Sure, I will create an issue based on tasks that we discuss on the last call. Also, I will create issue for SDK doc autogen.
First of all, as I mentioned here: https://github.com/kubeflow/katib/issues/2255#issuecomment-1910584792, I would suggest supporting kubernetes v1.27-v1.29.
Also, Moving #1906 forward would be better. It probably isn't possible to complete all the tasks, but I think we will be able to get some results.
I think, it would be better if we are going to remove support for MXJob in 2 releases. For example, in 1.8 release we are going to inform users that MXJob will be removed in the next version. And when we release 1.9 we will remove MXJob. That should give sufficient time for users to migrate even that MXNet has already been archived. WDYT @kubeflow/wg-training-leads @tenzen-y ?
SGTM. We can say that we don't any maintenance for MXJob during one release, which means it was deprecated. Creating a dedicated issue would be better.
@andreyvelich Sounds good
First of all, as I mentioned here: https://github.com/kubeflow/katib/issues/2255#issuecomment-1910584792, I would suggest supporting kubernetes v1.27-v1.29.
It's good point about Kubernetes version @tenzen-y! I agree that 1.27 - 1.29 should be our target. @kubeflow/release-team What do you think about target goal of supporting Kubernetes 1.27 - 1.29 for Kubeflow 1.9 release?
Ah, I found the features that we drop from the previous release due to the release deadline.
Can we put the following to improve UX:
- https://github.com/kubeflow/training-operator/issues/1993
- https://github.com/kubeflow/training-operator/issues/1708
I just had discussion with @kubeflow/release-managers on Kubernetes versions.
We are going to target Kubernetes 1.27 - 1.29
for the next release of Training Operator.
I just had discussion with @kubeflow/release-managers on Kubernetes versions. We are going to target Kubernetes
1.27 - 1.29
for the next release of Training Operator.
It's nice notifications! Thank you!
@johnugeorge @deepanker13 Do we need to create tracking issue with remaining items for Train/Fine-tune API for LLMs ?
Okay I will create one
Hello @kubeflow/wg-training-leads, this is a kind reminder that Monday, March 4th will be our Kubeflow 1.9 release development checkpoint, we will be halfway through our dev cycle, and we expect most of the work to be well underway (reminder: code freeze is scheduled for Apr 15th)
Can you please acknowledge your status with respect to your roadmap, comment on the progress made so far, and provide an assessment of the work that remains?
(understandably) Not everything may be completed in time. Please proactively let the release team know if there are delays, blockers, or uncertain situations, know so that we can align expectations and try and help you out, if possible.
Hi ! When is the v1.8 is planned for release? Some managed k8s versions e.g EKS reach end of support very soon. (July 24, 2024) https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar So this release very important k8s who plan to migrate. Is there any tentative timeline ? Please advise. @StefanoFioravanzo @andreyvelich
Hi @satishpasumarthi, we are planing to make the first RC.0 for Training Operator v1.8 this week. We will support Kubernetes v1.27-1.29 in that release.
Hi @satishpasumarthi, we are planing to make the first RC.0 for Training Operator v1.8 this week. We will support Kubernetes v1.27-1.29 in that release.
Thanks for the reply @andreyvelich . I see only PRs for supporting v1.28 and v1.29 https://github.com/kubeflow/training-operator/pull/2039 and https://github.com/kubeflow/training-operator/pull/2038. My understanding was v1.27 is already supported in v1.7. Please correct me if I am mistaken
Hi @satishpasumarthi, we are planing to make the first RC.0 for Training Operator v1.8 this week. We will support Kubernetes v1.27-1.29 in that release.
Thanks for the reply @andreyvelich . I see only PRs for supporting v1.28 and v1.29 #2039 and #2038. My understanding was v1.27 is already supported in v1.7. Please correct me if I am mistaken
@satishpasumarthi You're correct. In v1.7, the training-operator supports v1.25-v1.27. In v1.8, the training-operator will support v1.27-v1.29.
Is there anything missing to cut the release? We want to start the manifests sync for training-operator for Kubeflow 1.9.0-rc0
Is there anything missing to cut the release? We want to start the manifests sync for training-operator for Kubeflow 1.9.0-rc0
Not yet. Johnu will prepare the release today.
Any updates on when we might see a new release?
Any updates on when we might see a new release?
You can find the new release here: https://github.com/kubeflow/training-operator/releases/tag/v1.8.0-rc.0
Training Operator 1.8 has been released 🎉 https://github.com/kubeflow/training-operator/releases/tag/v1.8.0
Thanks everyone for your contributions!