training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Add Helm chart for kubeflow trainer

Open ChenYi015 opened this issue 10 months ago • 12 comments

What this PR does / why we need it:

Close #1197

Checklist:

  • [ ] Docs included if any changes are user facing

ChenYi015 avatar Feb 13 '25 11:02 ChenYi015

@ChenYi015 Given that we created Helm Charts for JobSet, are we ready to finish this PR? @kannon92 did we push charts to the OCI registry after this PR: https://github.com/kubernetes-sigs/jobset/pull/792 ?

andreyvelich avatar Feb 25 '25 01:02 andreyvelich

@ChenYi015 Given that we created Helm Charts for JobSet, are we ready to finish this PR? @kannon92 did we push charts to the OCI registry after this PR: kubernetes-sigs/jobset#792 ?

Not yet. We are still working on that change. Once we have a release it should push to the registry.

kannon92 avatar Feb 25 '25 02:02 kannon92

/hold for waiting jobset helm chart to be published

ChenYi015 avatar Feb 26 '25 07:02 ChenYi015

/hold cancel for jobset chart v0.8.0 has been released. I have updated the PR, now one can test the Helm chart installation with the following comand:

helm upgrade kubeflow-trainer charts/kubeflow-trainer \
      --install \
      --dependency-update \
      --namespace kubeflow-system \
      --create-namespace \
      --wait \
      --timeout 5m0s \
      --set jobset.install=true

ChenYi015 avatar Mar 03 '25 03:03 ChenYi015

/unhold

ChenYi015 avatar Mar 03 '25 03:03 ChenYi015

@ChenYi015 keep me posted once you have the updates . thank you again for working on this.

varodrig avatar Mar 03 '25 14:03 varodrig

Thank you for this great contribution @ChenYi015! I left my initial comments. /assign @kubeflow/wg-training-leads @franciscojavierarceo @Electronic-Waste @astefanutti @saileshd1402 @kubeflow/wg-manifests-leads @chasecadet

This is great! @ChenYi015 please add any implementation details or really anything to https://github.com/chasecadet/KEP/tree/master/proposals/649-kubeflow-helm-support. We are working on a proposal and would love your feedback!

chasecadet avatar Mar 04 '25 04:03 chasecadet

I think I have addressed most of the comments, PTAL when you have time. @andreyvelich @kubeflow/wg-training-leads

ChenYi015 avatar Mar 07 '25 07:03 ChenYi015

@andreyvelich Thanks for the review. I have created a new issue https://github.com/kubeflow/trainer/issues/2572 to track the manifests syncing script and will implement it in another PR to address the remaining comments.

ChenYi015 avatar Mar 28 '25 07:03 ChenYi015

Pull Request Test Coverage Report for Build 14385147230

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 66.475%

Totals Coverage Status
Change from base Build 14132728736: 0.0%
Covered Lines: 1735
Relevant Lines: 2610

💛 - Coveralls

coveralls avatar Mar 28 '25 07:03 coveralls

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: saileshd1402.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/lgtm /cc @tenzen-y @johnugeorge @saileshd1402 @Electronic-Waste @juliusvonkohout @deepanker13 @franciscojavierarceo @thesuperzapper @astefanutti

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Mar 30 '25 20:03 google-oss-prow[bot]

Can we remove the kubeflow-trainer- prefix from the JobSet controller or this is a requirements that all resources created from this chart must start with kubeflow-trainer-?

I thinks this is the default behavior of installing jobset as a sub chart of trainer. I will try to find a way to remove the prefix.

For JobSet why do we create RoleBinding and ClusterRoleBinding ?

The ClusterRole/ClusterRoleBinding comes from role.yaml and role_binding.yaml and are used to grant cluster-scoped permissions.

The Role/RoleBinding comes from leader_election_role.yaml and leader_election_role_binding.yaml and are used to grant namespace-scoped permissions (e.g. leader election).

ChenYi015 avatar Apr 10 '25 15:04 ChenYi015

The Role/RoleBinding comes from leader_election_role.yaml and leader_election_role_binding.yaml and are used to grant namespace-scoped permissions (e.g. leader election).

Oh, I see. Maybe we should rename the postfix for this role as: -leader-election. To be consistent with manifests: https://github.com/kubernetes-sigs/jobset/blob/main/config/components/rbac/role_binding.yaml

cc @kannon92

andreyvelich avatar Apr 10 '25 15:04 andreyvelich

The ClusterTrainingRuntime are not installed as part of the Charts.

We may have problems when do helm upgrading, when the webhook server are restarted and not ready to validate/mutate ClusterTrainingRuntime, it will fail the upgrading. I will try to find a better way to manage ClusterTrainingRuntime in the follow-up PRs.

ChenYi015 avatar Apr 10 '25 16:04 ChenYi015

Thank you for the update! Feel free to un-hold the PR if you want to address the JobSet naming changes in the followup PRs. /lgtm /approve /hold

andreyvelich avatar Apr 10 '25 16:04 andreyvelich

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Apr 10 '25 16:04 google-oss-prow[bot]

Feel free to un-hold the PR if you want to address the JobSet naming changes in the followup PRs.

@andreyvelich Thank you for your review and comments. Let us improve the Helm chart step by step.

/unhold

ChenYi015 avatar Apr 11 '25 02:04 ChenYi015