amazon-network-policy-controller-k8s PolicyEndpoint CRD deletion on addon upgrade can lead to indefinite network interruption in rare circumstances

When the vpc cni addon is upgraded to 1.20 (unsure about earlier versions), it deletes the policyendpoint crd, and does not recreate it. In #180, a controller to automatically recreate this crd was added (cc @Issacwww ) and this cleverly exits the controller to trigger a full resync. However, there are two controllers, both running this logic, and the non-leader may be the one that first notices that the crd is missing, and it can exit, restart immediately, and recreate the crd before the other one notices (you have up to 15s in which to do so, but I have observed it happen in 1s). NB that due to ( I presume) backoff restarting the controller, the more times you do an upgrade, the longer it takes for the crd to be recreated, and once it takes longer than 15s, it is impossible that the leader will not notice that the crd is gone. So this is a bit of a heisenbug; a few attempts to reproduce and it becomes impossible.

Without the resync for the leader, we continue as normal; we start creating policyendpoints when the underlying network policies get reconciled, but this will mean that only a subset of policy endpoints exist. If i have a 'allow all egress' policy in addition to a 'allow egress to kube-dns' policy, then the kube-dns policy will be created when a coredns pod is deleted, but the allow all egress policy might not be created, for an arbitrarily long duration. During this time, egress is isolated except to coredns.

Repro:

kubectl create namespace  test-crd-delete

# create your netpols
kubectl -n  test-crd-delete apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all-egress
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
EOF

kubectl -n  test-crd-delete apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-one-target
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: doesnotexist
EOF

# create a test pod
kubectl -n  test-crd-delete run netshoot --image=nicolaka/netshoot -it --restart=Never -- bash

# validate your policyendpoints exist
kubectl -n test-crd-delete describe policyendpoints.networking.k8s.aws

# validate outbound connectivity in your netshoot pod
curl google.com

# load the crd
kubectl get crd policyendpoints.networking.k8s.aws -o yaml > crd.yaml

# delete the policyendpoints crd (as would happen in addon upgrade) and simulate a fast recreation by a non-leader by creating it ourselves
kubectl delete crd policyendpoints.networking.k8s.aws && kubectl create -f crd.yaml

# observe that you have no policyendpoints
kubectl -n test-crd-delete describe policyendpoints.networking.k8s.aws

# trigger the allow-one-target netpol to be reconciled by updating it
kubectl -n test-crd-delete patch netpol allow-one-target --type='json' -p='[{"op": "replace", "path": "/spec/egress/0/to/0/podSelector/matchLabels/app", "value": "doesnotexist2"}]'

# observe that you have now only one policyendpoint
kubectl -n test-crd-delete describe policyendpoints.networking.k8s.aws

# validate that you have no outbound connectivity in your netshoot pod
curl google.com

Proposed fix: only the leader should monitor crd existence. Would be keen to contribute this if there is consensus

Nov 13 '25 20:11 jackkleeman

@jackkleeman We just removed the policyendpoint CRD to not get installed by the Addon going forward. But this is an interesting issue that you've pointed out with reconciler. We will investigate this condition

Nov 13 '25 20:11 jaydeokar

maybe that is why i found that upgrading my addon deleted the crd? this behaviour was very surprising to me, as deleting the crd deletes all policyendpoints which is certainly not something i would choose to do to a live cluster! perhaps it is only 1.20 addon upgrades that are the problem? we encounted a 30 minute long networking outage in prod after doing that 1.20 upgrade yesterday. the fix was to trigger the controller to recreate the policy endpoints, eg by labelling or deleting pods

Nov 13 '25 20:11 jackkleeman

crd deletes all policyendpoints which is certainly not something i would choose to do to a live cluster!

We are looking at it. Agreed this shouldn't be the behavior

Nov 13 '25 20:11 jaydeokar

To be clear, we are not trying to delete the definition from clusters. The intention was making a clean path for the CRD management. We do have other services that are guaranteeing the definition being installed. We have found the root cause and are working on a fix.

Nov 13 '25 23:11 haouc