azure-container-networking icon indicating copy to clipboard operation
azure-container-networking copied to clipboard

fix: [NPM] cleanup restarted pod stuck with no IP

Open huntergregory opened this issue 3 years ago • 2 comments

Overview

In AKS Windows Server '22 nodes with memory pressure, pods may restart and enter a perpetual Error state, where the pod is stuck in Running status with no assigned IP.

Issue 1

In general, if pod A is stuck in this Error state, we should clean up kernel state referencing the old IP. Fix: enqueue pod updates with no IP. The pod will be cleaned up.

Issue 2

Will fix this in another PR now: Scenario (this occurs 10% of the time in memory-starved clusters):

  • Pod A originally has an IP and endpoint x.
  • Pod A restarts, NPM restarts, and Pod B restarts with same IP and endpoint y.
  • NPM processes a Pod Create event for Pod A and its old IP before processing a Pod Create event for Pod B. NPM only sees endpoint y.
  • Endpoint y will incorrectly have pod A's policies

Fix: delete pod A's policies from pod B's endpoint, and eventually assign pod B to the endpoint

huntergregory avatar Aug 01 '22 21:08 huntergregory

Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact.

huntergregory avatar Aug 22 '22 19:08 huntergregory

Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact.

When queuing for Running Status only, there are only 4 update-with-empty-ip events for 736 regular update events (in a windows conformance run).

huntergregory avatar Aug 24 '22 18:08 huntergregory

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Dec 06 '22 00:12 github-actions[bot]

There is no performance impact for pure Linux clusters.

As discussed above there is hardly an impact for clusters with Windows Server '22:

There are only 4 update-with-empty-ip events for 736 regular update events

Experiments

Steps:

  1. Create deployment with 1 replica.
  2. Scale to 2k replicas.
  3. After a while, delete all 2k Pods. This causes 2k new Pods to be created too.

Experiment 1: Uptime SLA for API Server

  • Cluster: az aks create -g $rg -n $cluster --network-plugin azure --max-pods 250 -c 16 --uptime-sla
  • Pod Image: k8s.gcr.io/pause:3.2
    • This Pod doesn't have any memory/CPU overhead.

Results

There were 2 update-with-empty-ip events.

npm_controller_pod_event_total{operation="create"} 9
npm_controller_pod_event_total{operation="update"} 16025
npm_controller_pod_event_total{operation="update-with-empty-ip"} 2

Experiment 2: No Uptime SLA and Heavier Pod Image

  • Cluster downgraded: az aks update -g $rg -n $cluster --no-uptime-sla
  • Pod Image: k8s.gcr.io/e2e-test-images/agnhost:2.33
    • Command: /agnhost serve-hostname --tcp --http=false --port "80"

Results

There were no update-with-empty-ip events.

npm_controller_pod_event_total{operation="create"} 10
npm_controller_pod_event_total{operation="update"} 21391

huntergregory avatar Dec 16 '22 22:12 huntergregory

/azp run

huntergregory avatar Jan 03 '23 20:01 huntergregory

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Jan 03 '23 20:01 azure-pipelines[bot]

current test runs are successful excluding:

  1. flake in conf stress (verified unrelated to the change)
  2. HNS-related errors in windows cyc and conf

verified unrelated to the change by searching for "warning: ADD POD", a log line that would be hit in the control flow related to this change

huntergregory avatar Feb 14 '23 18:02 huntergregory