Jason Aliyetti
Jason Aliyetti
Some additional detail: We run on EKS using custom cni networking. We can confirm that we're frequently seeing Cilium hang on indefinitely to identities for pods that no longer exist....
Another datapoint: we found a correlation between this occurring and agents that seem to constantly have all endpoint either in the regenerating or waiting_to_regenerate state. Agents seem to occasionally get...
We upgraded to 1.12.2 last week and this occurred again last night. When we try to collect a sysdump from impacted agents we get an timeout error -- Cilium API...
@pchaigno no because there's no consistent reproduction we have for this at this point
Is it possible this'd be exacerbated by something like https://github.com/aws/containers-roadmap/issues/1810 ?
[logs.txt](https://github.com/cilium/cilium/files/9666739/logs.txt) [cilium-status.txt](https://github.com/cilium/cilium/files/9666777/cilium-status.txt) I've tried to cull some data from the agent logs and status output here. Some of it's been redacted and I've commented on that in the file where...
I was able to get a stack dump when the issue recurred by running cilium-bugtool on the pod. [gops.zip](https://github.com/cilium/cilium/files/9680331/gops.zip) If I grep the stack for "lockAlive" and sort them it...
@aanm apologies for tagging you directly, but is the above information helpful? We still get hit by this weekly across our clusters.
This might get fixed with https://github.com/cilium/cilium/pull/21629.
@joestringer any idea when 1.12.3 and a 1.11 backport would be available? We're also dealing with https://github.com/cilium/cilium/issues/20915 and aren't sure if this would help that problem or not and are...