azure-aci icon indicating copy to clipboard operation
azure-aci copied to clipboard

SIGSEGV errors after upgrading to v1.4.6

Open ttq-ak opened this issue 2 years ago • 0 comments

Describe the Issue At ~6pm on 03/11/22 our AKS addon virtual-node image was updated to run v1.4.6. From that point onwards the aci-connector-linux-* pod inside the aci-connector-linux deployment keeps crashing with a stack trace like below: image (5)

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x12e7971]

goroutine 177 [running]:
github.com/virtual-kubelet/azure-aci/pkg/provider.podStatusFromContainerGroup(0xc000e8b6d0)
        /workspace/pkg/provider/aci.go:1810 +0xb11
github.com/virtual-kubelet/azure-aci/pkg/provider.(*ACIProvider).GetPodStatus(0xc000192a80, {0x1ad62e8?, 0xc000a55fb0?}, {0xc001004f69, 0x7}, {0xc000b70960, 0x1e})
        /workspace/pkg/provider/aci.go:910 +0x2bc
github.com/virtual-kubelet/azure-aci/pkg/provider.(*ACIProvider).FetchPodStatus(0x1ad62e8?, {0x1ad62e8?, 0xc000a55fb0?}, {0xc001004f69?, 0x1?}, {0xc000b70960?, 0xb78f7b?})
        /workspace/pkg/provider/aci.go:985 +0x2d
github.com/virtual-kubelet/azure-aci/pkg/provider.(*PodsTracker).processPodUpdates(0xc000148500, {0x1ad62e8?, 0xc0008ce3c0?}, 0xc0006cb400)
        /workspace/pkg/provider/podsTracker.go:138 +0x196
github.com/virtual-kubelet/azure-aci/pkg/provider.(*PodsTracker).updatePodsLoop(0xc000148500, {0x1ad62e8?, 0xc0006779b0?})
        /workspace/pkg/provider/podsTracker.go:95 +0x13b
github.com/virtual-kubelet/azure-aci/pkg/provider.(*PodsTracker).StartTracking(0xc000061fd0?, {0x1ad62e8?, 0xc000990750?})
        /workspace/pkg/provider/podsTracker.go:60 +0x24b
created by github.com/virtual-kubelet/azure-aci/pkg/provider.(*ACIProvider).NotifyPods
        /workspace/pkg/provider/aci.go:961 +0x18d

Steps To Reproduce In our cluster just having the virtual-node addon enabled was enough to cause the pod to crash. We noticed if we already had pods active in the virtual node it would crash quicker. (Deleting the pods then restarting the aci-connector-linux pod would allow it to run for longer before it crashed)

Expected behavior v1.4.6 behaves exactly like v1.4.5 did

Virtual-kubelet version v1.4.6

azure-aci plugin version How do I tell this?

Kubernetes version 1.23.8

Additional context We're running an AKS cluster setup in private mode with Keda and the virtual-node addon. Rebooting the cluster and nodes didn't work Disabling and reenabling the addon also didn't fix it

We've now followed the steps in the downgrade docs to get v1.4.5 running alongside v1.4.6, but v1.4.6 is still crashing (Thanks for the tip about using labels so we don't keep trying to put pods on v1.4.6): https://github.com/virtual-kubelet/azure-aci/blob/master/docs/DOWNGRADE-README.md

I've left the addon running and I'm happy to provide logs if needed (You might just have to tell me what logs/commands you'd like me to run, I'm not massively knowledgeable about k8s)

ttq-ak avatar Nov 04 '22 16:11 ttq-ak