training-operator [Release] Training Operator 1.8 Roadmap

This is the tracking issue for Training Operator 1.8 release. The feature freeze date for the next Kubeflow 1.9 release is April 15th.

We are targeting the following features for Training Operator 1.8:

SDK

[x] Train/Fine-Tune API for LLMs
[x] Unit tests for SDKs: https://github.com/kubeflow/training-operator/pull/1938
[x] Consolidate SDK APIs: https://github.com/kubeflow/training-operator/issues/1877
[x] Fetch Job and Pod events: https://github.com/kubeflow/training-operator/issues/1863
[x] Set compute resources: https://github.com/kubeflow/training-operator/pull/1990

Backend

[x] https://github.com/kubeflow/training-operator/issues/1999
[x] Deprecation notice for MXJob: https://github.com/kubeflow/training-operator/issues/1996
[x] https://github.com/kubeflow/training-operator/issues/1993

Misc

[x] https://github.com/kubeflow/training-operator/issues/1998
[ ] Update contributor guide.
[x] Example for torchrun and PyTorchJob: https://github.com/kubeflow/training-operator/pull/1965

@deepanker13 @droctothorpe @tenzen-y @kubeflow/wg-training-leads @kuizhiqing @terrytangyuan @lowang-bh Please let me know items that we want to add for Training Operator 1.8.

cc @kubeflow/release-team

Jan 24 '24 21:01 andreyvelich

@johnugeorge @deepanker13 Do we need to create tracking issue with remaining items for Train/Fine-tune API for LLMs ?

Jan 24 '24 21:01 andreyvelich

I'd like to get https://github.com/kubeflow/training-operator/pull/1953 merged as well. I think the risk is pretty low.

Jan 25 '24 00:01 terrytangyuan

@andreyvelich thanks for putting this together. On the "Misc: Improve docs for the training operator", if you can start a seprate issue highligintg known issues, doc areas to be improved or particular topics you want to address we can start coordinating with the release team doc leads as well to get some help.

I would suggest having a separate issue for autogen APIs, in case you want to address that as well.

Jan 25 '24 08:01 StefanoFioravanzo

@terrytangyuan Sure, can we discuss the MXJob deprecation plan on the next AutoML and Training WG meeting ? I think, it would be better if we are going to remove support for MXJob in 2 releases. For example, in 1.8 release we are going to inform users that MXJob will be removed in the next version. And when we release 1.9 we will remove MXJob. That should give sufficient time for users to migrate even that MXNet has already been archived. WDYT @kubeflow/wg-training-leads @tenzen-y ?

if you can start a seprate issue highligintg known issues

@StefanoFioravanzo Sure, I will create an issue based on tasks that we discuss on the last call. Also, I will create issue for SDK doc autogen.

Jan 25 '24 16:01 andreyvelich

First of all, as I mentioned here: https://github.com/kubeflow/katib/issues/2255#issuecomment-1910584792, I would suggest supporting kubernetes v1.27-v1.29.

Also, Moving #1906 forward would be better. It probably isn't possible to complete all the tasks, but I think we will be able to get some results.

Jan 25 '24 16:01 tenzen-y

I think, it would be better if we are going to remove support for MXJob in 2 releases. For example, in 1.8 release we are going to inform users that MXJob will be removed in the next version. And when we release 1.9 we will remove MXJob. That should give sufficient time for users to migrate even that MXNet has already been archived. WDYT @kubeflow/wg-training-leads @tenzen-y ?

SGTM. We can say that we don't any maintenance for MXJob during one release, which means it was deprecated. Creating a dedicated issue would be better.

Jan 25 '24 16:01 tenzen-y

@andreyvelich Sounds good

Jan 25 '24 17:01 terrytangyuan

First of all, as I mentioned here: https://github.com/kubeflow/katib/issues/2255#issuecomment-1910584792, I would suggest supporting kubernetes v1.27-v1.29.

It's good point about Kubernetes version @tenzen-y! I agree that 1.27 - 1.29 should be our target. @kubeflow/release-team What do you think about target goal of supporting Kubernetes 1.27 - 1.29 for Kubeflow 1.9 release?

Jan 25 '24 20:01 andreyvelich

Ah, I found the features that we drop from the previous release due to the release deadline.

Can we put the following to improve UX:

https://github.com/kubeflow/training-operator/issues/1993
https://github.com/kubeflow/training-operator/issues/1708

Jan 29 '24 20:01 tenzen-y

I just had discussion with @kubeflow/release-managers on Kubernetes versions. We are going to target Kubernetes 1.27 - 1.29 for the next release of Training Operator.

Jan 29 '24 21:01 andreyvelich

I just had discussion with @kubeflow/release-managers on Kubernetes versions. We are going to target Kubernetes 1.27 - 1.29 for the next release of Training Operator.

It's nice notifications! Thank you!

Jan 29 '24 21:01 tenzen-y

@johnugeorge @deepanker13 Do we need to create tracking issue with remaining items for Train/Fine-tune API for LLMs ?

Okay I will create one

Feb 05 '24 08:02 deepanker13

Hello @kubeflow/wg-training-leads, this is a kind reminder that Monday, March 4th will be our Kubeflow 1.9 release development checkpoint, we will be halfway through our dev cycle, and we expect most of the work to be well underway (reminder: code freeze is scheduled for Apr 15th)

Can you please acknowledge your status with respect to your roadmap, comment on the progress made so far, and provide an assessment of the work that remains?

(understandably) Not everything may be completed in time. Please proactively let the release team know if there are delays, blockers, or uncertain situations, know so that we can align expectations and try and help you out, if possible.

Feb 28 '24 14:02 StefanoFioravanzo

Hi ! When is the v1.8 is planned for release? Some managed k8s versions e.g EKS reach end of support very soon. (July 24, 2024) https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar So this release very important k8s who plan to migrate. Is there any tentative timeline ? Please advise. @StefanoFioravanzo @andreyvelich

Apr 22 '24 17:04 satishpasumarthi

Hi @satishpasumarthi, we are planing to make the first RC.0 for Training Operator v1.8 this week. We will support Kubernetes v1.27-1.29 in that release.

Apr 22 '24 22:04 andreyvelich

Hi @satishpasumarthi, we are planing to make the first RC.0 for Training Operator v1.8 this week. We will support Kubernetes v1.27-1.29 in that release.

Thanks for the reply @andreyvelich . I see only PRs for supporting v1.28 and v1.29 https://github.com/kubeflow/training-operator/pull/2039 and https://github.com/kubeflow/training-operator/pull/2038. My understanding was v1.27 is already supported in v1.7. Please correct me if I am mistaken

Apr 23 '24 03:04 satishpasumarthi

Hi @satishpasumarthi, we are planing to make the first RC.0 for Training Operator v1.8 this week. We will support Kubernetes v1.27-1.29 in that release.

Thanks for the reply @andreyvelich . I see only PRs for supporting v1.28 and v1.29 #2039 and #2038. My understanding was v1.27 is already supported in v1.7. Please correct me if I am mistaken

@satishpasumarthi You're correct. In v1.7, the training-operator supports v1.25-v1.27. In v1.8, the training-operator will support v1.27-v1.29.

Apr 23 '24 04:04 tenzen-y

Is there anything missing to cut the release? We want to start the manifests sync for training-operator for Kubeflow 1.9.0-rc0

Apr 26 '24 15:04 rimolive

Is there anything missing to cut the release? We want to start the manifests sync for training-operator for Kubeflow 1.9.0-rc0

Not yet. Johnu will prepare the release today.

Apr 26 '24 17:04 tenzen-y

Any updates on when we might see a new release?

Jun 14 '24 23:06 philkuz

Any updates on when we might see a new release?

You can find the new release here: https://github.com/kubeflow/training-operator/releases/tag/v1.8.0-rc.0

Jun 15 '24 09:06 tenzen-y

Training Operator 1.8 has been released 🎉 https://github.com/kubeflow/training-operator/releases/tag/v1.8.0

Thanks everyone for your contributions!

Jul 25 '24 17:07 andreyvelich

training-operator training-operator copied to clipboard

[Release] Training Operator 1.8 Roadmap

SDK

Backend

Misc

training-operator
training-operator copied to clipboard