training-operator Enable GPU Testing for LLM Blueprints

What you would like to be added?

We will soon introduce LLM Blueprints which typically require GPUs to run them: https://github.com/kubeflow/trainer/pull/2410. To support this, we need to explore using GitHub self-hosted runners with GPU support.

If we can get them, we need to see if we can deploy Kubernetes cluster with these runners and NVIDIA GPU Operator.

cc @Electronic-Waste @astefanutti @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @franciscojavierarceo

Why is this needed?

We need GPU for our testing infrastructure.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Feb 11 '25 18:02 andreyvelich

@andreyvelich I did a very brief research and it seems like kind does not support GPUs natively, however there are custom configurations to enable it, one is a maintained fork from NVIDIA itself:

https://github.com/NVIDIA/nvkind

otherwise, minikube has native support:

https://minikube.sigs.k8s.io/docs/tutorials/nvidia/

Feb 19 '25 18:02 mahdikhashan

This is good news that minikube supports Nvidia device plugin!

I remember that @astefanutti and @franciscojavierarceo also explored how to leverage Nvidia with local k8s cluster: https://github.com/kubeflow/sdk/issues/22

Feb 20 '25 13:02 andreyvelich

/area gsoc

Feb 28 '25 09:02 Electronic-Waste

Hi Andrey,

Thank you for bringing this up! The introduction of LLM Blueprints is an exciting step, and having GPU-supported testing infrastructure will definitely ensure we can fully leverage their potential.

I agree that exploring GitHub self-hosted runners with GPU support is a critical priority. Deploying a Kubernetes cluster with these runners and integrating the NVIDIA GPU Operator sounds like a great plan for resource management and scalability. I’d be happy to assist with this effort or provide any input needed.

Looking forward to collaborating with everyone on making this a reality!

Mar 15 '25 04:03 SimardeepSingh-zsh

Hey @andreyvelich ,

This is an interesting idea, I just have a questions on the specifics of this issue.

Is the scope of this project to create documentation that details how to set up and tear down a GPU-enabled Kubernetes cluster running on GitHub self-hosted runners for testing and also to add templates/examples in the Kubeflow repo so that users can easily replicate this setup themselves?

Mar 21 '25 00:03 nb923

@andreyvelich Is there any preference, what should we use to create local k8s cluster with GPU support? Or could you elaborate what things we should consider while selecting tech stack for creating such cluster.

Mar 26 '25 00:03 izuku-sds

Thanks everyone for your interest in this work!

Is the scope of this project to create documentation that details how to set up and tear down a GPU-enabled Kubernetes cluster running on GitHub self-hosted runners for testing and also to add templates/examples in the Kubeflow repo so that users can easily replicate this setup themselves?

The scope of this project would be to configure our testing infrastructure to re-use GPU infrastructure that Oracle can give us.

That can be one of two ways:

Re-use GitHub self-hosted runners that has GPU enabled. That will require to figure out how to deploy Kind/Minikube cluster with NVIDIA Driver, so we can utilize GPU.
Connect to the existing Kubernetes cluster (OKE) and run our tests.

It would be nice if you could explain in the proposal two approaches and what we need to do to achieve it.

We still discuss with @jaiakash and team what would be the preferable solution.

Is there any preference, what should we use to create local k8s cluster with GPU support? Or could you elaborate what things we should consider while selecting tech stack for creating such cluster.

Any cluster that can support NVIDIA GPU Driver would work. If we move forward with self-hosted runners, using Kind would be preferable, since we already run it for our E2Es.

Mar 30 '25 22:03 andreyvelich

Hi, so i tried running the https://github.com/kubeflow/trainer projects's e2e test on my personal machine. Link - branch-self-runner-akash

test-runner is my machine where i executed self runner

ARC - Github also support action runner container. Our requirements also aligns with same issue. My suggestion is to use ARC so we can replicate this solution on OKE as well as current GPU infra.
Label Runner - For existing infra, plan is to mark GPU infra with specials labels so for AI tasks we only use those resources and other task we can use generic resources.
Security - The official GitHub documentation itself says self-runners are not very secure to on public repos, so 2 options 1. have members manually trigger workflow for PRs and/or 2. trigger intensive workflow only on main branch

Ask -

Any sample LLM blueprint to test? I have tested CI and E2E action of few projects, there are connecting and executing. There is permission issue since its my local machine, that can be fixes easily if using cloud.
I will be using my personal machine as cluster, it does have good specs but a basic GPU (NVIDIA 1650 ti) but i think for testing connection and flow, that would be fine! So later if possible if i can get cloud with appropriate requirement.

Mar 31 '25 13:03 jaiakash

Thanks everyone for your interest in this work!

Is the scope of this project to create documentation that details how to set up and tear down a GPU-enabled Kubernetes cluster running on GitHub self-hosted runners for testing and also to add templates/examples in the Kubeflow repo so that users can easily replicate this setup themselves?

The scope of this project would be to configure our testing infrastructure to re-use GPU infrastructure that Oracle can give us.

That can be one of two ways:

Re-use GitHub self-hosted runners that has GPU enabled. That will require to figure out how to deploy Kind/Minikube cluster with NVIDIA Driver, so we can utilize GPU.

Connect to the existing Kubernetes cluster (OKE) and run our tests.

It would be nice if you could explain in the proposal two approaches and what we need to do to achieve it.

We still discuss with @jaiakash and team what would be the preferable solution.

Is there any preference, what should we use to create local k8s cluster with GPU support? Or could you elaborate what things we should consider while selecting tech stack for creating such cluster.

Any cluster that can support NVIDIA GPU Driver would work. If we move forward with self-hosted runners, using Kind would be preferable, since we already run it for our E2Es.

@andreyvelich @jaiakash can we provide and confirm that Oracle will be providing the GPU to run this infrastructure? I'm also interested to see how this will work in terms on access and resources allocation.

Apr 20 '25 22:04 varodrig

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 07 '25 20:09 github-actions[bot]

This has been resolved by: https://github.com/kubeflow/trainer/pull/2689 /close

Sep 07 '25 22:09 andreyvelich

@andreyvelich: Closing this issue.

In response to this:

This has been resolved by: https://github.com/kubeflow/trainer/pull/2689 /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 07 '25 22:09 google-oss-prow[bot]