azure-container-networking fix: [NPM] cleanup restarted pod stuck with no IP

Overview

In AKS Windows Server '22 nodes with memory pressure, pods may restart and enter a perpetual Error state, where the pod is stuck in Running status with no assigned IP.

Issue 1

In general, if pod A is stuck in this Error state, we should clean up kernel state referencing the old IP. Fix: enqueue pod updates with no IP. The pod will be cleaned up.

Issue 2

Will fix this in another PR now: Scenario (this occurs 10% of the time in memory-starved clusters):

Pod A originally has an IP and endpoint x.
Pod A restarts, NPM restarts, and Pod B restarts with same IP and endpoint y.
NPM processes a Pod Create event for Pod A and its old IP before processing a Pod Create event for Pod B. NPM only sees endpoint y.
Endpoint y will incorrectly have pod A's policies

Fix: delete pod A's policies from pod B's endpoint, and eventually assign pod B to the endpoint

Aug 01 '22 21:08 huntergregory

Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact.

Aug 22 '22 19:08 huntergregory

Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact.

When queuing for Running Status only, there are only 4 update-with-empty-ip events for 736 regular update events (in a windows conformance run).

Aug 24 '22 18:08 huntergregory

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Dec 06 '22 00:12 github-actions[bot]

There is no performance impact for pure Linux clusters.

As discussed above there is hardly an impact for clusters with Windows Server '22:

There are only 4 update-with-empty-ip events for 736 regular update events

Experiments

Steps:

Create deployment with 1 replica.
Scale to 2k replicas.
After a while, delete all 2k Pods. This causes 2k new Pods to be created too.

Experiment 1: Uptime SLA for API Server

Cluster: az aks create -g $rg -n $cluster --network-plugin azure --max-pods 250 -c 16 --uptime-sla
Pod Image: k8s.gcr.io/pause:3.2
- This Pod doesn't have any memory/CPU overhead.

Results

There were 2 update-with-empty-ip events.

npm_controller_pod_event_total{operation="create"} 9
npm_controller_pod_event_total{operation="update"} 16025
npm_controller_pod_event_total{operation="update-with-empty-ip"} 2

Experiment 2: No Uptime SLA and Heavier Pod Image

Cluster downgraded: az aks update -g $rg -n $cluster --no-uptime-sla
Pod Image: k8s.gcr.io/e2e-test-images/agnhost:2.33
- Command: /agnhost serve-hostname --tcp --http=false --port "80"

Results

There were no update-with-empty-ip events.

npm_controller_pod_event_total{operation="create"} 10
npm_controller_pod_event_total{operation="update"} 21391

Dec 16 '22 22:12 huntergregory

/azp run

Jan 03 '23 20:01 huntergregory

Azure Pipelines successfully started running 2 pipeline(s).

Jan 03 '23 20:01 azure-pipelines[bot]

current test runs are successful excluding:

flake in conf stress (verified unrelated to the change)
HNS-related errors in windows cyc and conf

verified unrelated to the change by searching for "warning: ADD POD", a log line that would be hit in the control flow related to this change

Feb 14 '23 18:02 huntergregory

azure-container-networking azure-container-networking copied to clipboard

fix: [NPM] cleanup restarted pod stuck with no IP

Overview

Issue 1

Issue 2

Experiments

Experiment 1: Uptime SLA for API Server

Results

Experiment 2: No Uptime SLA and Heavier Pod Image

Results

azure-container-networking
azure-container-networking copied to clipboard