fdb-kubernetes-operator
fdb-kubernetes-operator copied to clipboard
Operator attempting to connect to non-fdb pods on TCP 4501 - expected?
Hey guys!
We're using networkPolicies to restrict which pods our pods can talk to.. I have policy permitting fdb-operator to talk to the fdb pods on TCP 4501 and 8080, but I'm seeing attempts from the operator to talk to virtually every other pod in the namespace as well (these attempts are blocked, but noisy):
root@cowboy:~# kubectl describe pod -n test-mailhog-gbfkpf | grep 4501
Warning PacketDrop 28s (x3 over 4m38s) kube-iptables-tailer Packet dropped when receiving traffic from test-mailhog-gbfkpf (10.114.205.39) on port 4501/TCP
Warning PacketDrop 27s (x3 over 4m37s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 29s (x3 over 4m39s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 33s (x3 over 4m43s) kube-iptables-tailer Packet dropped when sending traffic to 10.114.205.160 on port 4501/TCP
Warning PacketDrop 28s (x3 over 4m38s) kube-iptables-tailer Packet dropped when sending traffic to 10.112.219.199 on port 4501/TCP
Warning PacketDrop 28s (x3 over 4m38s) kube-iptables-tailer Packet dropped when sending traffic to 10.127.67.192 on port 4501/TCP
Warning PacketDrop 28s (x3 over 4m38s) kube-iptables-tailer Packet dropped when sending traffic to 10.125.137.134 on port 4501/TCP
Warning PacketDrop 29s (x3 over 4m39s) kube-iptables-tailer Packet dropped when sending traffic to 10.112.219.169 on port 4501/TCP
Warning PacketDrop 29s (x3 over 4m39s) kube-iptables-tailer Packet dropped when sending traffic to test-mailhog-gbfkpf (10.127.220.253) on port 4501/TCP
Warning PacketDrop 30s (x3 over 4m40s) kube-iptables-tailer Packet dropped when sending traffic to dev (10.112.219.173) on port 4501/TCP
Warning PacketDrop 30s (x3 over 4m40s) kube-iptables-tailer Packet dropped when sending traffic to preview-mailhog-pr-143 (10.127.220.203) on port 4501/TCP
Warning PacketDrop 30s (x3 over 4m40s) kube-iptables-tailer Packet dropped when sending traffic to 10.112.219.190 on port 4501/TCP
Warning PacketDrop 30s (x3 over 4m40s) kube-iptables-tailer Packet dropped when sending traffic to prod (10.114.205.63) on port 4501/TCP
Warning PacketDrop 33s (x3 over 4m43s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 29s (x3 over 4m39s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 32s (x3 over 4m42s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 33s (x3 over 4m43s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 33s (x3 over 4m43s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 28s (x3 over 4m38s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 32s (x3 over 4m42s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 33s (x3 over 4m43s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 30s (x3 over 4m40s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 34s (x3 over 4m44s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 31s (x3 over 4m41s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 33s (x3 over 4m43s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 29s (x3 over 4m39s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 34s (x3 over 4m44s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 34s (x3 over 4m44s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 29s (x3 over 4m39s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 31s (x3 over 4m41s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
Warning PacketDrop 35s (x3 over 4m45s) kube-iptables-tailer Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
root@cowboy:~#
Is this expected, or due to some sort of FDB "discovery" process?
Thanks! D
The operator only tries to connect to 4501
right? That seems like this is related to a fdbcli
or fdb client
bahviour. You might also ask this question in https://forums.foundationdb.org. All these Pods are are clients of a FoundationDB cluster?
I would only expect the operator to talk to FDB pods. However, if pod get rescheduled with different IPs, there could be components in the database that continue to try connecting to old IPs.
I've made some progress debugging this.
In our environment, we spin up/down an FDB cluster every time we test a commit. It seems that fdb-operator is still trying to talk to the IPs which were used for fdb pods which have since been destroyed. These IPs then get re-used in another test environment (potentially for entirely unrelated pods), and trigger network policy drops.
Then I debugged further.. it doesn't seem as if the operator "notices" (or at least logs) when a foundationdbcluster resource is deleted. Here are the latest logs from the operator:
{"level":"info","ts":1628115449.006371,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.ExcludeInstances"}
{"level":"info","ts":1628115449.0065224,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.ChangeCoordinators"}
{"level":"info","ts":1628115449.0065928,"logger":"fdbclient","msg":"Running command","namespace":"preview-database-pr-189","cluster":"retort-fdb","path":"/usr/bin/fdb/6.3/fdbcli","args":["/usr/bin/fdb/6.3/fdbcli","--exec","status minimal","-C","/tmp/847471480","--log","--trace_format","xml","--timeout","10","--log-dir","/var/log/fdb"]}
{"level":"info","ts":1628115449.1243062,"logger":"fdbclient","msg":"Command completed","namespace":"preview-database-pr-189","cluster":"retort-fdb","output":"The database is avai..."}
{"level":"info","ts":1628115449.124418,"logger":"fdbclient","msg":"Fetch status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.1551352,"logger":"fdbclient","msg":"Successfully fetched status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.1554048,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.BounceProcesses"}
{"level":"info","ts":1628115449.1555007,"logger":"fdbclient","msg":"Fetch status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.1832068,"logger":"fdbclient","msg":"Successfully fetched status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.1833994,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.UpdatePods"}
{"level":"info","ts":1628115449.184687,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.RemoveServices"}
{"level":"info","ts":1628115449.1847312,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.RemoveProcessGroups"}
{"level":"info","ts":1628115449.1848888,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.UpdateStatus"}
{"level":"info","ts":1628115449.1849866,"logger":"fdbclient","msg":"Fetch status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.214518,"logger":"fdbclient","msg":"Successfully fetched status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.2692754,"logger":"controller","msg":"Reconciliation complete","namespace":"preview-database-pr-189","cluster":"retort-fdb","generation":1}
I deleted the resource about 5 min ago:
root@cowboy:~# kubectl delete foundationdbclusters.apps.foundationdb.org -n preview-database-pr-189 retort-fdb
foundationdbcluster.apps.foundationdb.org "retort-fdb" deleted
root@cowboy:~#
And the operator hasn't logged anything about it.
Presumably this is causing the fdbcli process to continue trying to connect to the IPs originally used in this cluster.
So, my conclusion is, the errors I'm seeing are because the operator doesn't react when a CR is deleted.
Cheers! D
We used to have code in the operator to react to a CR being deleted and clear up the cache of the database connections, but we removed that cache because it proved to be too brittle. Maybe what you're observing is that the operator is keeping connections open to the database, and doesn't close them when the CR is deleted.
Mmm.. given we could spin up/down 50-100 clusters per day, this could be a problem :)
What can I do to avoid this? One obvious idea is to deploy an instance of operator into each namespace as I deploy its fdb cluster, and disable global mode. (IIRC this caused problems last time I tried due to the fact that multiple operators would clash on creating CRDs)
Any other suggestions? :)
Thanks! D
Deploying an instance in the operator to each namespace should work. If you're going that route, you would want to separate out the deployment of the CRD, since you only need to deploy that once for each Kubernetes cluster. Another option is to delete the operator pod, which should cause it to get recreated without any of the old connections.
Ideally we should address this in the operator, by closing the connections automatically, but I'm not sure what that would entail.
@funkypenguin are you still seeing this issue with a recent version? The operator is not using the cached mechanism anymore and directly closes the db connections once it's done fetching the data, so I would assume that this issue is resolved.
Oooh, that could be useful! I'll test and let you know...
I'm going to close this issue. Feel free to reopen this issue if you still see this behaviour.