Rolling update panic with warm pools
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
Version 1.23.2 (git-1.23.2)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
kubernetesVersion: 1.22.10
3. What cloud provider are you using?
cloudProvider: aws
4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster --yes. We were also upgrading from kOps 1.22.4 to 1.23.2, in case that's relevant.
5. What happened after the commands executed?
Panic.
6. What did you expect to happen?
Smooth rolling update.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
# ...
spec:
# ...
cloudProvider: aws
kubernetesVersion: 1.22.10 # was 1.22.7, in case that's relevant
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
# ...
spec:
# ...
maxSize: 30
minSize: 1
role: Node
rollingUpdate:
drainAndTerminate: true
maxSurge: 60%
warmPool:
enableLifecycleHook: true
maxSize: 4
minSize: 3
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
I0609 14:19:12.802210 485 instancegroups.go:309] Tainting 2 nodes in "nodes-eu-central-1a" instancegroup.
I0609 14:19:12.837133 485 instancegroups.go:131] deleting warm pool instance "AAAA"
I0609 14:19:13.004328 485 instancegroups.go:131] deleting warm pool instance "BBBB"
I0609 14:19:13.161806 485 instancegroups.go:131] deleting warm pool instance "CCCC"
I0609 14:19:13.309859 485 instancegroups.go:565] Detaching instance "DDDD", ...
I0609 14:19:13.564841 485 instancegroups.go:170] waiting for 15s after detaching instance
I0609 14:19:28.565202 485 instancegroups.go:467] Validating the cluster.
I0609 14:19:30.949595 485 instancegroups.go:503] Cluster validated; revalidating in 10s to make sure it does not flap.
I0609 14:19:44.024648 485 instancegroups.go:500] Cluster validated.
I0609 14:19:44.024687 485 instancegroups.go:565] Detaching instance "EEEE", ...
panic: runtime error: index out of range [-1]
goroutine 1 [running]:
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup(0xc0005ba900, 0xc0007de000, 0x37e11d600)
k8s.io/kops/pkg/instancegroups/instancegroups.go:154 +0x1031
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).RollingUpdate(0xc0005ba900, 0xc0007d5590, 0xc0008aeb40)
k8s.io/kops/pkg/instancegroups/rollingupdate.go:196 +0xa45
main.RunRollingUpdateCluster({0x7439c80, 0xc00005c1c8}, 0xc0008bfd98, {0x7343860, 0xc000010018}, 0xc00071a510)
k8s.io/kops/cmd/kops/rolling-update_cluster.go:434 +0xf45
main.NewCmdRollingUpdateCluster.func1(0xc000765900, {0xc000088800, 0x2, 0x2})
k8s.io/kops/cmd/kops/rolling-update_cluster.go:176 +0x37
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).execute(0xc000765900, {0xc0000887e0, 0x2, 0x2})
k8s.io/kops/vendor/github.com/spf13/cobra/command.go:856 +0x60e
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x9ece900)
k8s.io/kops/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).Execute(...)
k8s.io/kops/vendor/github.com/spf13/cobra/command.go:902
main.Execute()
k8s.io/kops/cmd/kops/root.go:95 +0x5c
main.main()
k8s.io/kops/cmd/kops/main.go:20 +0x17
9. Anything else do we need to know?
Similar to #11774
We're still seeing this quite frequently.
Please reproduce this with kops 1.24.1. It should give more information about why the detach fails that may cause this.
Noted. We're currently still using 1.23.2.
Still seeing this with kOps 1.24.1. Here are some logs:
I0908 09:43:08.745335 6639 instancegroups.go:311] Tainting 6 nodes in "nodes-eu-central-1c" instancegroup.
I0908 09:43:08.848436 6639 instancegroups.go:132] deleting warm pool instance "AAAAA"
I0908 09:43:09.003241 6639 instancegroups.go:132] deleting warm pool instance "BBBBB"
I0908 09:43:09.141794 6639 instancegroups.go:132] deleting warm pool instance "CCCCC"
I0908 09:43:09.293505 6639 instancegroups.go:132] deleting warm pool instance "DDDDD"
I0908 09:43:09.463532 6639 instancegroups.go:132] deleting warm pool instance "EEEEE"
I0908 09:43:09.632396 6639 instancegroups.go:132] deleting warm pool instance "FFFFF"
I0908 09:43:09.791094 6639 instancegroups.go:132] deleting warm pool instance "GGGGG"
I0908 09:43:09.962070 6639 instancegroups.go:132] deleting warm pool instance "HHHHH"
I0908 09:43:10.099380 6639 instancegroups.go:132] deleting warm pool instance "IIIII"
I0908 09:43:10.241848 6639 instancegroups.go:132] deleting warm pool instance "JJJJJ"
I0908 09:43:10.385283 6639 instancegroups.go:568] Detaching instance "KKKKK", node "ip-KKKKK.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:10.739226 6639 instancegroups.go:172] waiting for 15s after detaching instance
I0908 09:43:25.739428 6639 instancegroups.go:470] Validating the cluster.
I0908 09:43:28.186852 6639 instancegroups.go:506] Cluster validated; revalidating in 10s to make sure it does not flap.
I0908 09:43:40.244820 6639 instancegroups.go:503] Cluster validated.
I0908 09:43:40.244869 6639 instancegroups.go:568] Detaching instance "LLLLL", node "ip-LLLLL.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:40.678965 6639 instancegroups.go:568] Detaching instance "MMMMM", node "ip-MMMMM.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.120884 6639 instancegroups.go:568] Detaching instance "NNNNN", node "ip-NNNNN.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.547119 6639 instancegroups.go:568] Detaching instance "OOOOO", node "ip-OOOOO.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.957413 6639 instancegroups.go:568] Detaching instance "PPPPP", node "ip-PPPPP.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
panic: runtime error: index out of range [-1]
goroutine 1 [running]:
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup(0xc000452240, 0xc0002c01c0, 0x37e11d600)
k8s.io/kops/pkg/instancegroups/instancegroups.go:155 +0x1410
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).RollingUpdate(0xc000452240, 0xc00087cbd0, 0xc000a53560?)
k8s.io/kops/pkg/instancegroups/rollingupdate.go:196 +0xa69
main.RunRollingUpdateCluster({0x7187d58?, 0xc00005c1e8}, 0x0?, {0x716c340, 0xc000010018}, 0xc0008085a0)
k8s.io/kops/cmd/kops/rolling-update_cluster.go:434 +0xf45
main.NewCmdRollingUpdateCluster.func1(0xc000177900?, {0xc000707380?, 0x2?, 0x2?})
k8s.io/kops/cmd/kops/rolling-update_cluster.go:176 +0x37
github.com/spf13/cobra.(*Command).execute(0xc000177900, {0xc000707360, 0x2, 0x2})
github.com/spf13/[email protected]/command.go:872 +0x694
github.com/spf13/cobra.(*Command).ExecuteC(0x94984a0)
github.com/spf13/[email protected]/command.go:990 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
github.com/spf13/[email protected]/command.go:918
main.Execute()
k8s.io/kops/cmd/kops/root.go:95 +0x5c
main.main()
k8s.io/kops/cmd/kops/main.go:20 +0x17
Do you happen to have an ASG with warm pools enabled and being scaled to 0? In that case, I think https://github.com/kubernetes/kops/blob/9ed92a963ede06a0f6ca28ff7729605a6e9f9ad7/pkg/instancegroups/instancegroups.go#L155 would end up with trying to access update[0 - 1 - 0].
Yep. This will cause a panic. Reproduced!