terraform-provider-kops icon indicating copy to clipboard operation
terraform-provider-kops copied to clipboard

GPU instance groups apply loop

Open ddelange opened this issue 1 year ago • 1 comments

Hi 👋

We've upgraded from kops 1.23 to 1.26 (provider 1.26.0-rc1). The upgrade was successful after some trial and error. Now, when we run apply again, the updater is always triggered:

  # kops_instance_group.workers["ondemand-amd-32GiB-8vCPU-1GPU-eu-central-1c"] will be updated in-place
  ~ resource "kops_instance_group" "workers" {
        id                           = "domain.com/ondemand-amd-32GiB-8vCPU-1GPU-eu-central-1c"
        name                         = "ondemand-amd-32GiB-8vCPU-1GPU-eu-central-1c"
      ~ node_labels                  = {
          - "kops.k8s.io/gpu" = "1" -> null
        }
      ~ revision                     = 7 -> 8
      ~ taints                       = [
            "arch=amd64:PreferNoSchedule",
          - "nvidia.com/gpu:NoSchedule",
        ]
        # (29 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

The corresponding RuntimeClass looks OK: https://github.com/kubernetes/kops/blob/v1.26.4/upup/models/cloudup/resources/addons/nvidia.addons.k8s.io/k8s-1.16.yaml.template#L44-L59

But somehow the node labels and taints are not in sync between kops and the cluster anymore 🤔

ddelange avatar Jul 06 '23 04:07 ddelange