azure-container-networking
azure-container-networking copied to clipboard
fix: [NPM] cleanup restarted pod stuck with no IP
Overview
In AKS Windows Server '22 nodes with memory pressure, pods may restart and enter a perpetual Error state, where the pod is stuck in Running status with no assigned IP.
Issue 1
In general, if pod A is stuck in this Error state, we should clean up kernel state referencing the old IP. Fix: enqueue pod updates with no IP. The pod will be cleaned up.
Issue 2
Will fix this in another PR now: Scenario (this occurs 10% of the time in memory-starved clusters):
- Pod A originally has an IP and endpoint x.
- Pod A restarts, NPM restarts, and Pod B restarts with same IP and endpoint y.
- NPM processes a Pod Create event for Pod A and its old IP before processing a Pod Create event for Pod B. NPM only sees endpoint y.
- Endpoint y will incorrectly have pod A's policies
Fix: delete pod A's policies from pod B's endpoint, and eventually assign pod B to the endpoint
Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact.
Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact.
When queuing for Running Status only, there are only 4 update-with-empty-ip events for 736 regular update events (in a windows conformance run).
This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
There is no performance impact for pure Linux clusters.
As discussed above there is hardly an impact for clusters with Windows Server '22:
There are only 4 update-with-empty-ip events for 736 regular update events
Experiments
Steps:
- Create deployment with 1 replica.
- Scale to 2k replicas.
- After a while, delete all 2k Pods. This causes 2k new Pods to be created too.
Experiment 1: Uptime SLA for API Server
- Cluster:
az aks create -g $rg -n $cluster --network-plugin azure --max-pods 250 -c 16 --uptime-sla - Pod Image:
k8s.gcr.io/pause:3.2- This Pod doesn't have any memory/CPU overhead.
Results
There were 2 update-with-empty-ip events.
npm_controller_pod_event_total{operation="create"} 9
npm_controller_pod_event_total{operation="update"} 16025
npm_controller_pod_event_total{operation="update-with-empty-ip"} 2
Experiment 2: No Uptime SLA and Heavier Pod Image
- Cluster downgraded:
az aks update -g $rg -n $cluster --no-uptime-sla - Pod Image:
k8s.gcr.io/e2e-test-images/agnhost:2.33- Command:
/agnhost serve-hostname --tcp --http=false --port "80"
- Command:
Results
There were no update-with-empty-ip events.
npm_controller_pod_event_total{operation="create"} 10
npm_controller_pod_event_total{operation="update"} 21391
/azp run
Azure Pipelines successfully started running 2 pipeline(s).
current test runs are successful excluding:
- flake in conf stress (verified unrelated to the change)
- HNS-related errors in windows cyc and conf
verified unrelated to the change by searching for "warning: ADD POD", a log line that would be hit in the control flow related to this change