linkerd2-proxy-init icon indicating copy to clipboard operation
linkerd2-proxy-init copied to clipboard

fix(linkerd-cni): prevent parent from outpacing forked `monitor_cni_config()` process

Open sdickhoven opened this issue 2 months ago • 2 comments

this pr addresses a race condition in which the parent process outpaces the forked monitor_cni_config() process.

this race condition was observed in a production cluster and can clearly be seen in the log output:

Screenshot 2025-10-14 at 11 38 17

as can be seen above, the inotifywait is about 140ms late to the party.

the result is that existing cni config files are not patched which then causes the repair controller to incessantly "murder" pods that start up on the doomed worker node.

i should add that we've been running linkerd-cni v1.6.3 (which includes the fix for previously encountered race conditions) since 16 jul 2025 and v1.6.4 since 15 sep 2025 without a problem. yesterday was the first time we've seen this (new) race condition.

sdickhoven avatar Oct 15 '25 18:10 sdickhoven

i tested the logic with the following script

#!/usr/bin/env bash

set -e

#inotifywait -m /tmp &

myfunc() {
  sleep 2
  inotifywait -m .
}

myfunc &
monitor_pid=$!

while true; do
  monitor_state=$(
    (ps --ppid=$monitor_pid -o comm=,state= || true) |
    awk '$1 == "inotifywait" && $2 == "S" {print "ok"}'
  )
  echo "tick $monitor_state"
  [ -z "$monitor_state" ] || break
  sleep .1 # 100ms
done

echo done

wait -n

i also threw in a second inotifywait (commented out above) to verify that only the inotifywait i care about is looked at by my logic.

it is also possible to see the new logic in action by inserting a sleep 2 just before this line (which i also did during testing):

https://github.com/linkerd/linkerd2-proxy-init/blob/cni-plugin/v1.6.4/cni-plugin/deployment/scripts/install-cni.sh#L295

sdickhoven avatar Oct 15 '25 18:10 sdickhoven

Thanks for the very detailed overview. We should be able to take a deeper look at this next week.

olix0r avatar Oct 17 '25 12:10 olix0r

We encountered this issue recently too after upgrading from edge-2024.11.8 to edge-2025.8.5, on EKS 1.32 using the AWS VPC CNI. Thanks for the patch! I've built a custom linkerd-cni image with this PR's install-cni.sh script and we're using it successfully so far.

ericluria avatar Nov 05 '25 23:11 ericluria