[BUG] ALB Controller informer cache fails to sync ApplicationLoadBalancer CR after leader election

Open maxesse opened this issue 1 month ago • 3 comments

Summary

This issue happens on both 1.8.12 and 1.9.6 versions of the helm chart. For me it happens almost always after 7 days from the last 'fix'. After acquiring leader lease, the ALB controller's informer cache fails to populate the ApplicationLoadBalancer CR, causing all Gateway/HTTPRoute reconciliations to fail with "ApplicationLoadBalancer CRD not found" errors. This state persists indefinitely until the controller pods are manually restarted.

Environment

AKS version: 1.33.5
ALB Controller version: 1.9.6
Helm chart: mcr.microsoft.com/application-lb/charts/alb-controller:1.9.6
Node provisioning: Azure Node Auto Provisioning (NAP) with Karpenter

Steps to Reproduce

Deploy ALB controller on Karpenter-managed nodes (ephemeral VMs)
Wait for Karpenter to bin-pack or terminate the node running the controller leader
New controller pod starts and attempts to acquire leader lease
After acquiring lease, controller enters broken state

Expected Behavior

After acquiring leader lease, the controller should:

Wait for all informer caches to sync
Begin reconciliation only after caches are populated
Successfully find the ApplicationLoadBalancer CR when processing Gateways

Actual Behavior

The controller acquires the lease but immediately starts reconciling before informer caches are synced:

{"level":"info","Timestamp":"2025-12-02T18:23:07.357Z","message":"successfully acquired lease azure-alb-system/alb-controller-leader-election"}
{"level":"info","Timestamp":"2025-12-02T18:23:07.568Z","message":"Reconciling","object":{"name":"collab-aks-alb","namespace":"azure-alb-external"}}
{"level":"info","Timestamp":"2025-12-02T18:23:07.568Z","message":"Reconcile successful"}
{"level":"info","Timestamp":"2025-12-02T18:23:07.571Z","message":"ApplicationLoadBalancer CRD azure-alb-external/collab-aks-alb not found for Gateway azure-alb-external/entryms-gateway"}

Note: The ALB reconciliation succeeds (line 3), but immediately after, when processing a Gateway, the controller cannot find the same ALB in its cache (line 4).

This "not found" error repeats for every resource reconciliation for hours:

2025-12-02T18:23:07 - First "not found" error
2025-12-03T08:11:52 - Still showing "not found" (14+ hours later)

The only recovery is deleting the gateway resource and manually restarting the controller pods.

Additional Observations

1. Leader Lease Starvation (24+ hours)

{"Timestamp":"2025-12-01T17:42:58Z","message":"attempting to acquire leader lease..."}
{"Timestamp":"2025-12-02T18:23:07Z","message":"successfully acquired lease..."} // 24+ hours later!

When the previous leader's node is terminated by Karpenter, the lease is not released cleanly. The new pod waits the full lease expiry duration.

2. Secondary issue with VM Provider ID Format when using karpenter

The controller logs errors about Karpenter-provisioned nodes having invalid provider IDs:

{"level":"error","message":"Invalid or malformed Azure provider ID 'azure:///subscriptions/.../virtualMachines/aks-spot-lbctf'. Please ensure the node has a valid Azure provider ID in the format: azure:///subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmssName}/virtualMachines/{instanceId}"}

Karpenter/NAP creates standalone VMs (virtualMachines/), not VMSS instances (virtualMachineScaleSets/.../virtualMachines/).

For the time being I moved alb-controller to a systempool node not controlled by karpenter as the nodes won't churn and it will mitigate the issue. Something I should have done anyways a while ago.

Logs

Full controller logs from incident available upon request.

Dec 03 '25 14:12 maxesse