fix(linkerd-cni): prevent parent from outpacing forked `monitor_cni_config()` process
this pr addresses a race condition in which the parent process outpaces the forked monitor_cni_config() process.
this race condition was observed in a production cluster and can clearly be seen in the log output:
as can be seen above, the inotifywait is about 140ms late to the party.
the result is that existing cni config files are not patched which then causes the repair controller to incessantly "murder" pods that start up on the doomed worker node.
i should add that we've been running linkerd-cni v1.6.3 (which includes the fix for previously encountered race conditions) since 16 jul 2025 and v1.6.4 since 15 sep 2025 without a problem. yesterday was the first time we've seen this (new) race condition.
i tested the logic with the following script
#!/usr/bin/env bash
set -e
#inotifywait -m /tmp &
myfunc() {
sleep 2
inotifywait -m .
}
myfunc &
monitor_pid=$!
while true; do
monitor_state=$(
(ps --ppid=$monitor_pid -o comm=,state= || true) |
awk '$1 == "inotifywait" && $2 == "S" {print "ok"}'
)
echo "tick $monitor_state"
[ -z "$monitor_state" ] || break
sleep .1 # 100ms
done
echo done
wait -n
i also threw in a second inotifywait (commented out above) to verify that only the inotifywait i care about is looked at by my logic.
it is also possible to see the new logic in action by inserting a sleep 2 just before this line (which i also did during testing):
https://github.com/linkerd/linkerd2-proxy-init/blob/cni-plugin/v1.6.4/cni-plugin/deployment/scripts/install-cni.sh#L295
Thanks for the very detailed overview. We should be able to take a deeper look at this next week.
We encountered this issue recently too after upgrading from edge-2024.11.8 to edge-2025.8.5, on EKS 1.32 using the AWS VPC CNI. Thanks for the patch! I've built a custom linkerd-cni image with this PR's install-cni.sh script and we're using it successfully so far.