linkerd2-proxy-init fix(linkerd-cni): prevent parent from outpacing forked `monitor_cni

this pr addresses a race condition in which the parent process outpaces the forked monitor_cni_config() process.

this race condition was observed in a production cluster and can clearly be seen in the log output:

as can be seen above, the inotifywait is about 140ms late to the party.

the result is that existing cni config files are not patched which then causes the repair controller to incessantly "murder" pods that start up on the doomed worker node.

i should add that we've been running linkerd-cni v1.6.3 (which includes the fix for previously encountered race conditions) since 16 jul 2025 and v1.6.4 since 15 sep 2025 without a problem. yesterday was the first time we've seen this (new) race condition.

Oct 15 '25 18:10 sdickhoven

i tested the logic with the following script

#!/usr/bin/env bash

set -e

#inotifywait -m /tmp &

myfunc() {
  sleep 2
  inotifywait -m .
}

myfunc &
monitor_pid=$!

while true; do
  monitor_state=$(
    (ps --ppid=$monitor_pid -o comm=,state= || true) |
    awk '$1 == "inotifywait" && $2 == "S" {print "ok"}'
  )
  echo "tick $monitor_state"
  [ -z "$monitor_state" ] || break
  sleep .1 # 100ms
done

echo done

wait -n

i also threw in a second inotifywait (commented out above) to verify that only the inotifywait i care about is looked at by my logic.

it is also possible to see the new logic in action by inserting a sleep 2 just before this line (which i also did during testing):

https://github.com/linkerd/linkerd2-proxy-init/blob/cni-plugin/v1.6.4/cni-plugin/deployment/scripts/install-cni.sh#L295

Oct 15 '25 18:10 sdickhoven

Thanks for the very detailed overview. We should be able to take a deeper look at this next week.

Oct 17 '25 12:10 olix0r

We encountered this issue recently too after upgrading from edge-2024.11.8 to edge-2025.8.5, on EKS 1.32 using the AWS VPC CNI. Thanks for the patch! I've built a custom linkerd-cni image with this PR's install-cni.sh script and we're using it successfully so far.

Nov 05 '25 23:11 ericluria

fix(linkerd-cni): prevent parent from outpacing forked `monitor_cni_config()` process