Add Helm chart for kubeflow trainer
What this PR does / why we need it:
Close #1197
Checklist:
- [ ] Docs included if any changes are user facing
@ChenYi015 Given that we created Helm Charts for JobSet, are we ready to finish this PR? @kannon92 did we push charts to the OCI registry after this PR: https://github.com/kubernetes-sigs/jobset/pull/792 ?
@ChenYi015 Given that we created Helm Charts for JobSet, are we ready to finish this PR? @kannon92 did we push charts to the OCI registry after this PR: kubernetes-sigs/jobset#792 ?
Not yet. We are still working on that change. Once we have a release it should push to the registry.
/hold for waiting jobset helm chart to be published
/hold cancel for jobset chart v0.8.0 has been released. I have updated the PR, now one can test the Helm chart installation with the following comand:
helm upgrade kubeflow-trainer charts/kubeflow-trainer \
--install \
--dependency-update \
--namespace kubeflow-system \
--create-namespace \
--wait \
--timeout 5m0s \
--set jobset.install=true
/unhold
@ChenYi015 keep me posted once you have the updates . thank you again for working on this.
Thank you for this great contribution @ChenYi015! I left my initial comments. /assign @kubeflow/wg-training-leads @franciscojavierarceo @Electronic-Waste @astefanutti @saileshd1402 @kubeflow/wg-manifests-leads @chasecadet
This is great! @ChenYi015 please add any implementation details or really anything to https://github.com/chasecadet/KEP/tree/master/proposals/649-kubeflow-helm-support. We are working on a proposal and would love your feedback!
I think I have addressed most of the comments, PTAL when you have time. @andreyvelich @kubeflow/wg-training-leads
@andreyvelich Thanks for the review. I have created a new issue https://github.com/kubeflow/trainer/issues/2572 to track the manifests syncing script and will implement it in another PR to address the remaining comments.
Pull Request Test Coverage Report for Build 14385147230
Warning: This coverage report may be inaccurate.
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
- For more information on this, see Tracking coverage changes with pull request builds.
- To avoid this issue with future PRs, see these Recommended CI Configurations.
- For a quick fix, rebase this PR at GitHub. Your next report should be accurate.
Details
- 0 of 0 changed or added relevant lines in 0 files are covered.
- No unchanged relevant lines lost coverage.
- Overall coverage remained the same at 66.475%
| Totals | |
|---|---|
| Change from base Build 14132728736: | 0.0% |
| Covered Lines: | 1735 |
| Relevant Lines: | 2610 |
💛 - Coveralls
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: saileshd1402.
Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.
In response to this:
/lgtm /cc @tenzen-y @johnugeorge @saileshd1402 @Electronic-Waste @juliusvonkohout @deepanker13 @franciscojavierarceo @thesuperzapper @astefanutti
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Can we remove the kubeflow-trainer- prefix from the JobSet controller or this is a requirements that all resources created from this chart must start with kubeflow-trainer-?
I thinks this is the default behavior of installing jobset as a sub chart of trainer. I will try to find a way to remove the prefix.
For JobSet why do we create RoleBinding and ClusterRoleBinding ?
The ClusterRole/ClusterRoleBinding comes from role.yaml and role_binding.yaml and are used to grant cluster-scoped permissions.
The Role/RoleBinding comes from leader_election_role.yaml and leader_election_role_binding.yaml and are used to grant namespace-scoped permissions (e.g. leader election).
The Role/RoleBinding comes from leader_election_role.yaml and leader_election_role_binding.yaml and are used to grant namespace-scoped permissions (e.g. leader election).
Oh, I see. Maybe we should rename the postfix for this role as: -leader-election. To be consistent with manifests: https://github.com/kubernetes-sigs/jobset/blob/main/config/components/rbac/role_binding.yaml
cc @kannon92
The ClusterTrainingRuntime are not installed as part of the Charts.
We may have problems when do helm upgrading, when the webhook server are restarted and not ready to validate/mutate ClusterTrainingRuntime, it will fail the upgrading. I will try to find a better way to manage ClusterTrainingRuntime in the follow-up PRs.
Thank you for the update! Feel free to un-hold the PR if you want to address the JobSet naming changes in the followup PRs. /lgtm /approve /hold
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: andreyvelich
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [andreyvelich]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Feel free to un-hold the PR if you want to address the JobSet naming changes in the followup PRs.
@andreyvelich Thank you for your review and comments. Let us improve the Helm chart step by step.
/unhold