training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Use cncf-hosted gha runners

Open jeefy opened this issue 9 months ago • 7 comments

Description

CNCF has hosted ephemeral GitHub runners in Oracle that we're wanting projects to use rather than the GitHub hosted ones, which are now incur a cost to use. ~This PR is currently a WIP to work through any tests that break or dependencies that may be missing.~ <3

Please direct any questions to myself, @krook and @RobertKielty

jeefy avatar Mar 17 '25 17:03 jeefy

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Mar 17 '25 17:03 google-oss-prow[bot]

Yup! However there is a difference between the GitHub hosted and the CNCF hosted runners.

Our runners are running in containers. The GitHub hosted ones run in VMs. There can be dependency drift and unforeseen issues we'll have to work through. Should be fun though 🙂

On Mon, Mar 17, 2025, 14:37 Andrey Velichkevich @.***> wrote:

@.**** commented on this pull request.

This is great that we have access to the Oracle runners 🎉 Thank you for this @jeefy https://github.com/jeefy! Should we update this issue and update other repos (e.g. mpi-operator): kubeflow/community#829 https://github.com/kubeflow/community/issues/829 ? cc @kubeflow/wg-training-leads https://github.com/orgs/kubeflow/teams/wg-training-leads @kubeflow/kubeflow-steering-committee https://github.com/orgs/kubeflow/teams/kubeflow-steering-committee

— Reply to this email directly, view it on GitHub https://github.com/kubeflow/trainer/pull/2538#pullrequestreview-2691780096, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKUO6CVJGEAHADFEVQS2HD2U4P67AVCNFSM6AAAAABZGDQZDGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDMOJRG44DAMBZGY . You are receiving this because you were mentioned.Message ID: @.***>

jeefy avatar Mar 17 '25 22:03 jeefy

If we move all CI jobs to CNCF hosted runner, we need to create DinD container images so that we can perform Kind cluster during CI.

tenzen-y avatar Mar 18 '25 08:03 tenzen-y

If we move all CI jobs to CNCF hosted runner, we need to create DinD container images so that we can perform Kind cluster during CI.

If i understand this correctly, it is only for the more expensive larger GHA runners, not the default ones provided by GitHub. So we only need to migrate workflows that benefit from larger runners or VMs.

juliusvonkohout avatar Mar 18 '25 10:03 juliusvonkohout

If we move all CI jobs to CNCF hosted runner, we need to create DinD container images so that we can perform Kind cluster during CI.

If i understand this correctly, it is only for the more expensive larger GHA runners, not the default ones provided by GitHub. So we only need to migrate workflows that benefit from larger runners or VMs.

Yes, your understanding is correct. I indicated Trainer E2E, mostly.

tenzen-y avatar Mar 18 '25 10:03 tenzen-y

If we move all CI jobs to CNCF hosted runner, we need to create DinD container images so that we can perform Kind cluster during CI.

DIND is already baked into the current setup. You can do docker builds (and some other jobs already are)

Need to debug why your e2e/kind cluster didn't spin up though.

jeefy avatar Mar 18 '25 14:03 jeefy

If we move all CI jobs to CNCF hosted runner, we need to create DinD container images so that we can perform Kind cluster during CI.

DIND is already baked into the current setup. You can do docker builds (and some other jobs already are)

Need to debug why your e2e/kind cluster didn't spin up though.

Oh, I didn't know that. Thank you for letting us know

tenzen-y avatar Mar 18 '25 14:03 tenzen-y

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 16 '25 15:06 github-actions[bot]

This pull request has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Jul 06 '25 20:07 github-actions[bot]