training-operator Improve Training Operator release process

Related: https://github.com/kubeflow/katib/issues/2049

We need to improve our release process for Training Operator:

Branch names should follow this pattern: release-X.Y. Similar to Katib or Kubernetes.
Automate release with GitHub Actions.

/good-first-issue /help

Jun 25 '24 19:06 andreyvelich

@andreyvelich: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

Related: https://github.com/kubeflow/katib/issues/2049

We need to improve our release process for Training Operator:

Branch names should follow this pattern: release-X.Y. Similar to Katib or Kubernetes.

Automate release with GitHub Actions.

/good-first-issue /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 25 '24 19:06 google-oss-prow[bot]

I want to take this. /assign

Jun 26 '24 07:06 7h3-3mp7y-m4n

Additionally, I would like to use the semantic versioning image tags every for the release here: https://github.com/kubeflow/training-operator/blob/f8687ca7fd947e6ebd52dde4dfeefdf006e7b239/manifests/overlays/standalone/kustomization.yaml#L9

Jul 25 '24 07:07 tenzen-y

okay I'll look at it and raise a PR ASAP

Jul 29 '24 11:07 7h3-3mp7y-m4n

No one is working on this one right? I can take a look /assign

Oct 31 '24 12:10 Deathfireofdoom

No one is working on this one right? I can take a look /assign

Yes, feel free to take this.

Nov 01 '24 17:11 tenzen-y

Thank you for your time @Deathfireofdoom! I would also suggest to also check how we refactor and automate the Spark Operator release process with @ChenYi015 : https://github.com/kubeflow/spark-operator/pull/2089

I think, we can re-use some of the steps.

Nov 01 '24 17:11 andreyvelich

Hey @Deathfireofdoom if you're not working on this issue, then I would like to work on it.

Dec 14 '24 10:12 Veer0x1

@Veer0x1 Sorry for the delay, started tackling it but got hectic at work, chapter 11 stuff hahah, so will probably not have time to look into this more until after holiday anyway! So feel free to take it! :)

Dec 18 '24 17:12 Deathfireofdoom

/assign

Dec 21 '24 05:12 Veer0x1

@andreyvelich Any suggestion on how to handle changelog generation?

Mar 15 '25 20:03 milinddethe15

@Veer0x1 Please let us know if you still want to work on this, given that we very close to make the first Kubeflow Trainer releases, it would be great to automate our process!

/unassign @Veer0x1

@milinddethe15 Could you help us to explore how others solve it ? Do we need to introduce PR title check to simplify Changelog generation:

feat(...)
fix(...)
chore(...)

I think, we can use the same action as @thesuperzapper used for Kubeflow Notebooks, but just update the types and scope

Do you want to work on this @milinddethe15 ?

Mar 18 '25 18:03 andreyvelich

Yes, I am happy to help with this. I will look how other KF projects are doing this and give a update here before a PR.

/assign

Mar 18 '25 20:03 milinddethe15

Hi @milinddethe15, did you get a chance to work on this issue ?

We are planning to release Kubeflow Trainer 2.0 soon, and it would be nice to have release automation for it: https://github.com/kubeflow/trainer/issues/2170

Apr 11 '25 23:04 andreyvelich

I am working on it.

Apr 14 '25 10:04 milinddethe15

@andreyvelich I have pushed the commits and created a PR (https://github.com/milinddethe15/kf-trainer/pull/1) in the my fork repo for testing. However the github actions using ubuntu-latest-16-cores aren't gettting started. Is there any workaround to test the release process?

Apr 14 '25 15:04 milinddethe15

@milinddethe15 Do you want to try the default runner: ubuntu-latest to try out your release action ? Also, FYI, we don't need to release SDK as part of Kubeflow Trainer release since it will be decoupled from kubeflow/trainer after this KEP: https://github.com/kubeflow/community/pull/823.

Apr 14 '25 22:04 andreyvelich

@milinddethe15 Do you want to try the default runner: ubuntu-latest to try out your release action ?

yeah, I will try that out.

Also, FYI, we don't need to release SDK as part of Kubeflow Trainer release since it will be decoupled from kubeflow/trainer after this KEP: kubeflow/community#823.

Will this delay the Trainer v2.0 release until the NEW SDK is available?

Apr 15 '25 05:04 milinddethe15

@milinddethe15 Do you want to try the default runner: ubuntu-latest to try out your release action ?

yeah, I will try that out.

I have used the ubuntu-latest runners but the e2e tests are failing due to: no space left on device

Apr 15 '25 10:04 milinddethe15

Will this delay the Trainer v2.0 release until the NEW SDK is available?

No, we don't need to delay Trainer v2.0. For now, we just ask users to directly install SDK from the kubeflow/sdk repository.

I have used the ubuntu-latest runners but the e2e tests are failing due to: no space left on device

Can you try to test it without building the images ? Maybe you can just "fake" the image build to verify that the rest of the steps are working correct ?

Apr 15 '25 11:04 andreyvelich

Hi @milinddethe15, do you think we can target this enhancement before Kubeflow Trainer 2.0 release ? We are planning to cut release before May 5th

Apr 23 '25 02:04 andreyvelich

I have successfully setup the release actions. see at my forked release branch: https://github.com/milinddethe15/kf-trainer/tree/release-2.0 Now, automating the changelog generation, in draft release, is pending. we can use: https://github.com/kubeflow/trainer/blob/master/docs/release/changelog.py. However, grouping PRs into Breaking Changes, New Features, Bug fixes, Misc, etc. will be a manual task. So, can this step be skipped (I mean, CHANGELOG needs to be updated manually)?

Apr 23 '25 11:04 milinddethe15

That's great, yes, I think we can skip the Changelog generation for now.

For the Changelog, shall we apply the PR name validation to ask contributors to name PRs as follows: feat(...) chore(...) fix(...)

Similar to KFP and Kubeflow Notebooks ?

WDYT @kubeflow/wg-training-leads @Electronic-Waste @astefanutti ?

Apr 23 '25 16:04 andreyvelich

@milinddethe15 Also, why in your branch the images are not updated in the Kustomize manifests ? E.g. we should keep this image tag: v2.0.0 https://github.com/milinddethe15/kf-trainer/blob/release-2.0/manifests/overlays/manager/kustomization.yaml#L17

Apr 23 '25 16:04 andreyvelich

For the Changelog, shall we apply the PR name validation to ask contributors to name PRs as follows: feat(...) chore(...) fix(...)

Similar to KFP and Kubeflow Notebooks ?

SGTM

Apr 23 '25 17:04 Electronic-Waste

@milinddethe15 Also, why in your branch the images are not updated in the Kustomize manifests ? E.g. we should keep this image tag: v2.0.0 https://github.com/milinddethe15/kf-trainer/blob/release-2.0/manifests/overlays/manager/kustomization.yaml#L17

I am just testing the release actions here. Although we should check whether the image tags matches with the VERSION in the Check Release action.

Apr 24 '25 13:04 milinddethe15

@milinddethe15 Should we update the image tag as part of release action ? For example, we do that in the Katib release script: https://github.com/kubeflow/katib/blob/master/scripts/v1beta1/release.sh#L71

Apr 24 '25 14:04 andreyvelich

Yes, we can do that.

Apr 24 '25 18:04 milinddethe15

@milinddethe15 @Veer0x1 Do you want to finalize your PR to automate Kubeflow Trainer release process or we can find new contributor for it ?

https://github.com/kubeflow/trainer/pull/2359
https://github.com/kubeflow/trainer/pull/2623

May 28 '25 19:05 andreyvelich

/area engprod

May 28 '25 22:05 andreyvelich