azure-aci
azure-aci copied to clipboard
SIGSEGV errors after upgrading to v1.4.6
Describe the Issue
At ~6pm on 03/11/22 our AKS addon virtual-node
image was updated to run v1.4.6. From that point onwards the aci-connector-linux-* pod inside the aci-connector-linux deployment keeps crashing with a stack trace like below:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x12e7971]
goroutine 177 [running]:
github.com/virtual-kubelet/azure-aci/pkg/provider.podStatusFromContainerGroup(0xc000e8b6d0)
/workspace/pkg/provider/aci.go:1810 +0xb11
github.com/virtual-kubelet/azure-aci/pkg/provider.(*ACIProvider).GetPodStatus(0xc000192a80, {0x1ad62e8?, 0xc000a55fb0?}, {0xc001004f69, 0x7}, {0xc000b70960, 0x1e})
/workspace/pkg/provider/aci.go:910 +0x2bc
github.com/virtual-kubelet/azure-aci/pkg/provider.(*ACIProvider).FetchPodStatus(0x1ad62e8?, {0x1ad62e8?, 0xc000a55fb0?}, {0xc001004f69?, 0x1?}, {0xc000b70960?, 0xb78f7b?})
/workspace/pkg/provider/aci.go:985 +0x2d
github.com/virtual-kubelet/azure-aci/pkg/provider.(*PodsTracker).processPodUpdates(0xc000148500, {0x1ad62e8?, 0xc0008ce3c0?}, 0xc0006cb400)
/workspace/pkg/provider/podsTracker.go:138 +0x196
github.com/virtual-kubelet/azure-aci/pkg/provider.(*PodsTracker).updatePodsLoop(0xc000148500, {0x1ad62e8?, 0xc0006779b0?})
/workspace/pkg/provider/podsTracker.go:95 +0x13b
github.com/virtual-kubelet/azure-aci/pkg/provider.(*PodsTracker).StartTracking(0xc000061fd0?, {0x1ad62e8?, 0xc000990750?})
/workspace/pkg/provider/podsTracker.go:60 +0x24b
created by github.com/virtual-kubelet/azure-aci/pkg/provider.(*ACIProvider).NotifyPods
/workspace/pkg/provider/aci.go:961 +0x18d
Steps To Reproduce
In our cluster just having the virtual-node
addon enabled was enough to cause the pod to crash. We noticed if we already had pods active in the virtual node it would crash quicker. (Deleting the pods then restarting the aci-connector-linux pod would allow it to run for longer before it crashed)
Expected behavior v1.4.6 behaves exactly like v1.4.5 did
Virtual-kubelet version v1.4.6
azure-aci plugin version How do I tell this?
Kubernetes version 1.23.8
Additional context
We're running an AKS cluster setup in private mode with Keda and the virtual-node
addon.
Rebooting the cluster and nodes didn't work
Disabling and reenabling the addon also didn't fix it
We've now followed the steps in the downgrade docs to get v1.4.5 running alongside v1.4.6, but v1.4.6 is still crashing (Thanks for the tip about using labels so we don't keep trying to put pods on v1.4.6): https://github.com/virtual-kubelet/azure-aci/blob/master/docs/DOWNGRADE-README.md
I've left the addon running and I'm happy to provide logs if needed (You might just have to tell me what logs/commands you'd like me to run, I'm not massively knowledgeable about k8s)