kops icon indicating copy to clipboard operation
kops copied to clipboard

Rolling update panic with warm pools

Open yurrriq opened this issue 3 years ago • 3 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information.

Version 1.23.2 (git-1.23.2)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

kubernetesVersion: 1.22.10

3. What cloud provider are you using?

cloudProvider: aws

4. What commands did you run? What is the simplest way to reproduce this issue?

kops rolling-update cluster --yes. We were also upgrading from kOps 1.22.4 to 1.23.2, in case that's relevant.

5. What happened after the commands executed?

Panic.

6. What did you expect to happen?

Smooth rolling update.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
# ...
spec:
  # ...
  cloudProvider: aws
  kubernetesVersion: 1.22.10 # was 1.22.7, in case that's relevant
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
# ...
spec:
  # ...
  maxSize: 30
  minSize: 1
  role: Node
  rollingUpdate:
    drainAndTerminate: true
    maxSurge: 60%
  warmPool:
    enableLifecycleHook: true
    maxSize: 4
    minSize: 3

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

I0609 14:19:12.802210     485 instancegroups.go:309] Tainting 2 nodes in "nodes-eu-central-1a" instancegroup.
I0609 14:19:12.837133     485 instancegroups.go:131] deleting warm pool instance "AAAA"
I0609 14:19:13.004328     485 instancegroups.go:131] deleting warm pool instance "BBBB"
I0609 14:19:13.161806     485 instancegroups.go:131] deleting warm pool instance "CCCC"
I0609 14:19:13.309859     485 instancegroups.go:565] Detaching instance "DDDD", ...
I0609 14:19:13.564841     485 instancegroups.go:170] waiting for 15s after detaching instance
I0609 14:19:28.565202     485 instancegroups.go:467] Validating the cluster.
I0609 14:19:30.949595     485 instancegroups.go:503] Cluster validated; revalidating in 10s to make sure it does not flap.
I0609 14:19:44.024648     485 instancegroups.go:500] Cluster validated.
I0609 14:19:44.024687     485 instancegroups.go:565] Detaching instance "EEEE", ...
panic: runtime error: index out of range [-1]
goroutine 1 [running]:
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup(0xc0005ba900, 0xc0007de000, 0x37e11d600)
	k8s.io/kops/pkg/instancegroups/instancegroups.go:154 +0x1031
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).RollingUpdate(0xc0005ba900, 0xc0007d5590, 0xc0008aeb40)
	k8s.io/kops/pkg/instancegroups/rollingupdate.go:196 +0xa45
main.RunRollingUpdateCluster({0x7439c80, 0xc00005c1c8}, 0xc0008bfd98, {0x7343860, 0xc000010018}, 0xc00071a510)
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:434 +0xf45
main.NewCmdRollingUpdateCluster.func1(0xc000765900, {0xc000088800, 0x2, 0x2})
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:176 +0x37
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).execute(0xc000765900, {0xc0000887e0, 0x2, 0x2})
	k8s.io/kops/vendor/github.com/spf13/cobra/command.go:856 +0x60e
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x9ece900)
	k8s.io/kops/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	k8s.io/kops/vendor/github.com/spf13/cobra/command.go:902
main.Execute()
	k8s.io/kops/cmd/kops/root.go:95 +0x5c
main.main()
	k8s.io/kops/cmd/kops/main.go:20 +0x17

9. Anything else do we need to know?

Similar to #11774

yurrriq avatar Jun 09 '22 16:06 yurrriq

We're still seeing this quite frequently.

yurrriq avatar Aug 10 '22 19:08 yurrriq

Please reproduce this with kops 1.24.1. It should give more information about why the detach fails that may cause this.

olemarkus avatar Aug 10 '22 19:08 olemarkus

Noted. We're currently still using 1.23.2.

yurrriq avatar Aug 10 '22 19:08 yurrriq

Still seeing this with kOps 1.24.1. Here are some logs:

I0908 09:43:08.745335    6639 instancegroups.go:311] Tainting 6 nodes in "nodes-eu-central-1c" instancegroup.
I0908 09:43:08.848436    6639 instancegroups.go:132] deleting warm pool instance "AAAAA"
I0908 09:43:09.003241    6639 instancegroups.go:132] deleting warm pool instance "BBBBB"
I0908 09:43:09.141794    6639 instancegroups.go:132] deleting warm pool instance "CCCCC"
I0908 09:43:09.293505    6639 instancegroups.go:132] deleting warm pool instance "DDDDD"
I0908 09:43:09.463532    6639 instancegroups.go:132] deleting warm pool instance "EEEEE"
I0908 09:43:09.632396    6639 instancegroups.go:132] deleting warm pool instance "FFFFF"
I0908 09:43:09.791094    6639 instancegroups.go:132] deleting warm pool instance "GGGGG"
I0908 09:43:09.962070    6639 instancegroups.go:132] deleting warm pool instance "HHHHH"
I0908 09:43:10.099380    6639 instancegroups.go:132] deleting warm pool instance "IIIII"
I0908 09:43:10.241848    6639 instancegroups.go:132] deleting warm pool instance "JJJJJ"
I0908 09:43:10.385283    6639 instancegroups.go:568] Detaching instance "KKKKK", node "ip-KKKKK.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:10.739226    6639 instancegroups.go:172] waiting for 15s after detaching instance
I0908 09:43:25.739428    6639 instancegroups.go:470] Validating the cluster.
I0908 09:43:28.186852    6639 instancegroups.go:506] Cluster validated; revalidating in 10s to make sure it does not flap.
I0908 09:43:40.244820    6639 instancegroups.go:503] Cluster validated.
I0908 09:43:40.244869    6639 instancegroups.go:568] Detaching instance "LLLLL", node "ip-LLLLL.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:40.678965    6639 instancegroups.go:568] Detaching instance "MMMMM", node "ip-MMMMM.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.120884    6639 instancegroups.go:568] Detaching instance "NNNNN", node "ip-NNNNN.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.547119    6639 instancegroups.go:568] Detaching instance "OOOOO", node "ip-OOOOO.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.957413    6639 instancegroups.go:568] Detaching instance "PPPPP", node "ip-PPPPP.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
panic: runtime error: index out of range [-1]

goroutine 1 [running]:
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup(0xc000452240, 0xc0002c01c0, 0x37e11d600)
	k8s.io/kops/pkg/instancegroups/instancegroups.go:155 +0x1410
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).RollingUpdate(0xc000452240, 0xc00087cbd0, 0xc000a53560?)
	k8s.io/kops/pkg/instancegroups/rollingupdate.go:196 +0xa69
main.RunRollingUpdateCluster({0x7187d58?, 0xc00005c1e8}, 0x0?, {0x716c340, 0xc000010018}, 0xc0008085a0)
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:434 +0xf45
main.NewCmdRollingUpdateCluster.func1(0xc000177900?, {0xc000707380?, 0x2?, 0x2?})
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:176 +0x37
github.com/spf13/cobra.(*Command).execute(0xc000177900, {0xc000707360, 0x2, 0x2})
	github.com/spf13/[email protected]/command.go:872 +0x694
github.com/spf13/cobra.(*Command).ExecuteC(0x94984a0)
	github.com/spf13/[email protected]/command.go:990 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/[email protected]/command.go:918
main.Execute()
	k8s.io/kops/cmd/kops/root.go:95 +0x5c
main.main()
	k8s.io/kops/cmd/kops/main.go:20 +0x17

yurrriq avatar Sep 08 '22 12:09 yurrriq

Do you happen to have an ASG with warm pools enabled and being scaled to 0? In that case, I think https://github.com/kubernetes/kops/blob/9ed92a963ede06a0f6ca28ff7729605a6e9f9ad7/pkg/instancegroups/instancegroups.go#L155 would end up with trying to access update[0 - 1 - 0].

olemarkus avatar Sep 09 '22 07:09 olemarkus

Yep. This will cause a panic. Reproduced!

olemarkus avatar Sep 09 '22 07:09 olemarkus