kops kops update cluster --target=terraform panics for a GCE cluster with a bastion InstanceGroup

/kind bug

1. What kops version are you running? The command kops version, will display this information.

Client version: 1.26.3 (git-v1.26.3)

As far as I know, every kops version is affected.

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.24.14

3. What cloud provider are you using?

GCE

4. What commands did you run? What is the simplest way to reproduce this issue?

On an existing kops GCE cluster, create a bastion InstanceGroup as described in kops docs:

kops create ig --name k8s.cluster bastions --role Bastion --subnet utility-europe-west4
kops update cluster k8s.cluster --target=terraform --out=terraform/

5. What happened after the commands executed?

First, with the InstanceGroup manifest generated by the kops create ig command above, kops panics with an index out of range error:

W0609 11:31:40.268371   24080 external_access.go:39] TODO: Harmonize gcemodel ExternalAccessModelBuilder with awsmodel
W0609 11:31:40.268562   24080 firewall.go:41] TODO: Harmonize gcemodel with awsmodel for firewall - GCE model is way too open
W0609 11:31:40.268749   24080 storageacl.go:165] adding bucket level write IAM for role "redacted" to gs://ControlPlane to support etcd backup
panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
k8s.io/kops/pkg/model/gcemodel.(*AutoscalingGroupModelBuilder).splitToZones(0x58661c0?, 0xc000b8ac00)
	k8s.io/kops/pkg/model/gcemodel/autoscalinggroup.go:231 +0x1aa
k8s.io/kops/pkg/model/gcemodel.(*AutoscalingGroupModelBuilder).Build(0xc00069c220, 0xc0015e16c0?)
	k8s.io/kops/pkg/model/gcemodel/autoscalinggroup.go:269 +0x125
k8s.io/kops/upup/pkg/fi/cloudup.(*Loader).BuildTasks(0xc0005f9738, {0x58642b0, 0xc00012e000}, 0xc0005de4b0)
	k8s.io/kops/upup/pkg/fi/cloudup/loader.go:47 +0x124
k8s.io/kops/upup/pkg/fi/cloudup.(*ApplyClusterCmd).Run(0xc0005f9bd8, {0x58642b0, 0xc00012e000})
	k8s.io/kops/upup/pkg/fi/cloudup/apply_cluster.go:700 +0x54c5
main.RunUpdateCluster({0x58642b0, 0xc00012e000}, 0xc00055e2c0, {0x5838820, 0xc000130008}, 0xc000786790)
	k8s.io/kops/cmd/kops/update_cluster.go:293 +0xbb3
main.NewCmdUpdateCluster.func1(0xc0005c1200?, {0xc000b6be30?, 0x3?, 0x3?})
	k8s.io/kops/cmd/kops/update_cluster.go:110 +0x3a
github.com/spf13/cobra.(*Command).execute(0xc0005c1200, {0xc000b6bdd0, 0x3, 0x3})
	github.com/spf13/[email protected]/command.go:916 +0x862
github.com/spf13/cobra.(*Command).ExecuteC(0x7d1b2c0)
	github.com/spf13/[email protected]/command.go:1044 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/[email protected]/command.go:968
github.com/spf13/cobra.(*Command).ExecuteContext(...)
	github.com/spf13/[email protected]/command.go:961
main.Execute({0x58642b0?, 0xc00012e000})
	k8s.io/kops/cmd/kops/root.go:95 +0xab
main.main()
	k8s.io/kops/cmd/kops/main.go:23 +0x27

This can be fixed by manually specifying a zone in spec.zones array, for example europe-west4-b. Running the same kops update cluster command after that causes a segfault with this output:

W0609 11:33:43.251912   24895 external_access.go:39] TODO: Harmonize gcemodel ExternalAccessModelBuilder with awsmodel
W0609 11:33:43.252209   24895 firewall.go:41] TODO: Harmonize gcemodel with awsmodel for firewall - GCE model is way too open
W0609 11:33:43.252443   24895 storageacl.go:165] adding bucket level write IAM for role "redacted" to gs://ControlPlane to support etcd backup
W0609 11:33:43.252760   24895 autoscalinggroup.go:130] enabling storage-rw for etcd backups
W0609 11:33:43.253004   24895 autoscalinggroup.go:130] enabling storage-rw for etcd backups
W0609 11:33:43.253197   24895 autoscalinggroup.go:130] enabling storage-rw for etcd backups
I0609 11:33:43.258275   24895 executor.go:111] Tasks: 0 done / 95 total; 54 can run
I0609 11:33:43.298027   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.325407   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.360259   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.390925   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.422149   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.457276   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.494929   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.531397   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.562449   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.593414   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.623793   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.654772   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.687333   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.716446   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.749615   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.779673   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.807920   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.839785   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.901680   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.932934   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:43.966634   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:44.000258   24895 storage.go:65] bucket gs://redacted has bucket-policy only; won't try to set ACLs
I0609 11:33:44.000413   24895 executor.go:111] Tasks: 54 done / 95 total; 19 can run
I0609 11:33:44.354351   24895 executor.go:111] Tasks: 73 done / 95 total; 15 can run
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x20c1c89]

goroutine 627 [running]:
k8s.io/kops/upup/pkg/fi.CopyResource({0x582fa00, 0xc000b78450}, {0x0?, 0x0?})
	k8s.io/kops/upup/pkg/fi/resources.go:85 +0x69
k8s.io/kops/upup/pkg/fi.ResourceAsString({0x0, 0x0})
	k8s.io/kops/upup/pkg/fi/resources.go:103 +0x4c
k8s.io/kops/upup/pkg/fi/cloudup/gcetasks.(*InstanceTemplate).mapToGCE(0xc00107a1e0, {0xc000980cd8, 0x13}, {0xc000a0fe00, 0xc})
	k8s.io/kops/upup/pkg/fi/cloudup/gcetasks/instancetemplate.go:347 +0xb88
k8s.io/kops/upup/pkg/fi/cloudup/gcetasks.(*InstanceTemplate).RenderTerraform(0x5?, 0xc000638f00, 0xc0008b4600?, 0xc00107a1e0, 0x2?)
	k8s.io/kops/upup/pkg/fi/cloudup/gcetasks/instancetemplate.go:607 +0x73
reflect.Value.call({0x4b1c2a0?, 0xc00107a1e0?, 0x6?}, {0x4e08ac0, 0x4}, {0xc00109ac00, 0x4, 0x5885130?})
	reflect/value.go:584 +0x8c5
reflect.Value.Call({0x4b1c2a0?, 0xc00107a1e0?, 0x4e450bb?}, {0xc00109ac00?, 0xc001348260?, 0xc0015bc060?})
	reflect/value.go:368 +0xbc
k8s.io/kops/upup/pkg/fi.(*Context[...]).Render(0xc00132e5a0, {0x5837ca0, 0x0?}, {0x5837ca0, 0xc00107a1e0?}, {0x5837ca0, 0xc001318000?})
	k8s.io/kops/upup/pkg/fi/context.go:237 +0x10f5
k8s.io/kops/upup/pkg/fi.defaultDeltaRunMethod[...]({0x5837ca0, 0xc00107a1e0?}, 0xc00132e5a0)
	k8s.io/kops/upup/pkg/fi/default_methods.go:100 +0x4e5
k8s.io/kops/upup/pkg/fi.CloudupDefaultDeltaRunMethod(...)
	k8s.io/kops/upup/pkg/fi/default_methods.go:41
k8s.io/kops/upup/pkg/fi/cloudup/gcetasks.(*InstanceTemplate).Run(0xc001328300?, 0x5837ca0?)
	k8s.io/kops/upup/pkg/fi/cloudup/gcetasks/instancetemplate.go:233 +0x2d
k8s.io/kops/upup/pkg/fi.(*executor[...]).forkJoin.func1(0x4)
	k8s.io/kops/upup/pkg/fi/executor.go:195 +0x290
created by k8s.io/kops/upup/pkg/fi.(*executor[...]).forkJoin
	k8s.io/kops/upup/pkg/fi/executor.go:183 +0xbe

6. What did you expect to happen?

The Terraform code for deploying a bastion to be generated

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  generation: 1
  name: <redacted>
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudConfig:
    gcpPDCSIDriver:
      enabled: false
    manageStorageClasses: false
  cloudProvider: gce
  configBase: <redacted>
  containerRuntime: containerd
  containerd:
    configOverride: |
      version = 2
      [plugins]
        [plugins."io.containerd.grpc.v1.cri"]
          [plugins."io.containerd.grpc.v1.cri".containerd]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
                runtime_type = "io.containerd.runc.v2"
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                  SystemdCgroup = true
  dnsZone: <redacted>
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: control-plane-europe-west4-a
      name: a
    - instanceGroup: control-plane-europe-west4-b
      name: b
    - instanceGroup: control-plane-europe-west4-c
      name: c
    manager:
      env:
      - name: ETCD_LISTEN_METRICS_URLS
        value: http://0.0.0.0:8081
      - name: ETCD_METRICS
        value: extended
      - name: ETCD_MANAGER_HOURLY_BACKUPS_RETENTION
        value: 1d
      - name: ETCD_MANAGER_DAILY_BACKUPS_RETENTION
        value: 30d
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: control-plane-europe-west4-a
      name: a
    - instanceGroup: control-plane-europe-west4-b
      name: b
    - instanceGroup: control-plane-europe-west4-c
      name: c
    manager:
      env:
      - name: ETCD_LISTEN_METRICS_URLS
        value: http://0.0.0.0:8082
      - name: ETCD_METRICS
        value: extended
      - name: ETCD_MANAGER_HOURLY_BACKUPS_RETENTION
        value: 1d
      - name: ETCD_MANAGER_DAILY_BACKUPS_RETENTION
        value: 7d
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    auditLogMaxAge: 5
    auditLogMaxBackups: 1
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    auditPolicyFile: /srv/kubernetes/kube-apiserver/audit.conf
    defaultNotReadyTolerationSeconds: 150
    defaultUnreachableTolerationSeconds: 150
    disableBasicAuth: true
    enableProfiling: false
    eventTTL: 6h0m0s
    featureGates:
      EphemeralContainers: "true"
    logFormat: json
  kubeControllerManager:
    configureCloudRoutes: true
    featureGates:
      EphemeralContainers: "true"
      InTreePluginGCEUnregister: "true"
    horizontalPodAutoscalerDownscaleDelay: 3m0s
    horizontalPodAutoscalerSyncPeriod: 15s
    horizontalPodAutoscalerUpscaleDelay: 3m0s
    logFormat: json
  kubeDNS:
    nodeLocalDNS:
      enabled: true
    provider: CoreDNS
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubeScheduler:
    logFormat: json
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    cgroupDriver: systemd
    featureGates:
      EphemeralContainers: "true"
      InTreePluginGCEUnregister: "true"
    logFormat: json
  kubernetesApiAccess:
  - <redacted>
  kubernetesVersion: 1.24.14
  masterPublicName: <redacted>
  networking:
    canal: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  ntp:
    managed: false
  project: <redacted>
  sshAccess:
  - <redacted>
  subnets:
  - cidr: 10.0.16.0/20
    egress: External
    name: cluster-europe-west4
    region: europe-west4
    type: Private
  - cidr: 10.0.32.0/20
    egress: External
    name: utility-europe-west4
    region: europe-west4
    type: Utility
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

Jun 09 '23 09:06 flopib

Hi @tesspib. Could you check this again when kOps 1.27.0 is released? There are many GCE related improvements there.

Jul 17 '23 02:07 hakman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 24 '24 15:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 23 '24 15:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 24 '24 15:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 24 '24 15:03 k8s-ci-robot