cluster-api-provider-gcp icon indicating copy to clipboard operation
cluster-api-provider-gcp copied to clipboard

Add GPU/Accelerator support to VMs in GCPMachineTemplate

Open jwmay2012 opened this issue 1 year ago • 14 comments
trafficstars

What type of PR is this?

/kind feature

What this PR does / why we need it: Adds the ability to configure Guest Accelerators like GPUs in a GCPMachineTemplate Fixes #289

Special notes for your reviewer: Tested and creates machines with GPUs correctly. After installing drivers and nvidia container runtime on the node, was able to get the GPU to run successfully in a Pod. If you try to use an accelerator on the wrong instance type it will have an instance reconcile error from GCP that describes the improper API use.

OnHostMaintenance must be set to TERMINATE for GPU enabled machines. https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance#guest_accelerator Confirmed this is correct. Instance reconcile is rejected by GCP otherwise. I set this field automatically.

TODOs:

  • [x] squashed commits
  • [x] includes documentation
  • [ ] adds unit tests

Release note:

Add GPU/Accelerator support for VMs in GCPMachineTemplate

jwmay2012 avatar Oct 22 '24 17:10 jwmay2012

Hi @jwmay2012. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Oct 22 '24 17:10 k8s-ci-robot

Deploy Preview for kubernetes-sigs-cluster-api-gcp ready!

Name Link
Latest commit aed6b4c3656aeb4c5a95585fcb4dfefa56e123ea
Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-cluster-api-gcp/deploys/68d2c82892b2350008155960
Deploy Preview https://deploy-preview-1341--kubernetes-sigs-cluster-api-gcp.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

netlify[bot] avatar Oct 22 '24 17:10 netlify[bot]

Thanks @jwmay2012

/ok-to-test

salasberryfin avatar Oct 22 '24 19:10 salasberryfin

@jwmay2012 - would you be able to run make lint on this change?

richardcase avatar Oct 23 '24 12:10 richardcase

Could you please provide an estimate of when this change might be included in a release?

reyvonger avatar Dec 11 '24 18:12 reyvonger

Thanks @jwmay2012.

/lgtm

salasberryfin avatar Jan 02 '25 09:01 salasberryfin

We good to merge? Been running a custom CAPG with these changes for a while and would love to get this upstream :)

jwmay2012 avatar Jan 09 '25 19:01 jwmay2012

@richardcase are you happy with this? If so would you be able to stamp your approval on it? Thanks!

damdo avatar Jan 27 '25 14:01 damdo

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cpanato, jwmay2012

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jan 27 '25 14:01 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 27 '25 15:04 k8s-triage-robot

bump

reyvonger avatar Apr 27 '25 19:04 reyvonger

/cc @elmiko

damdo avatar May 21 '25 07:05 damdo

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 26 '25 16:06 k8s-triage-robot

bump

reyvonger avatar Jun 26 '25 16:06 reyvonger

/remove-lifecycle rotten

elmiko avatar Jun 30 '25 19:06 elmiko

@dims @richardcase help

reyvonger avatar Jul 15 '25 14:07 reyvonger

@jwmay2012 could you please rebase? Thanks!

damdo avatar Jul 15 '25 15:07 damdo

Or @reyvonger

damdo avatar Jul 15 '25 15:07 damdo