fdb-kubernetes-operator icon indicating copy to clipboard operation
fdb-kubernetes-operator copied to clipboard

Operator attempting to connect to non-fdb pods on TCP 4501 - expected?

Open funkypenguin opened this issue 3 years ago • 9 comments

Hey guys!

We're using networkPolicies to restrict which pods our pods can talk to.. I have policy permitting fdb-operator to talk to the fdb pods on TCP 4501 and 8080, but I'm seeing attempts from the operator to talk to virtually every other pod in the namespace as well (these attempts are blocked, but noisy):

root@cowboy:~# kubectl describe pod -n test-mailhog-gbfkpf | grep 4501
  Warning  PacketDrop              28s (x3 over 4m38s)  kube-iptables-tailer     Packet dropped when receiving traffic from test-mailhog-gbfkpf (10.114.205.39) on port 4501/TCP
  Warning  PacketDrop              27s (x3 over 4m37s)  kube-iptables-tailer     Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              29s (x3 over 4m39s)  kube-iptables-tailer     Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              33s (x3 over 4m43s)  kube-iptables-tailer     Packet dropped when sending traffic to 10.114.205.160 on port 4501/TCP
  Warning  PacketDrop              28s (x3 over 4m38s)  kube-iptables-tailer     Packet dropped when sending traffic to 10.112.219.199 on port 4501/TCP
  Warning  PacketDrop              28s (x3 over 4m38s)  kube-iptables-tailer     Packet dropped when sending traffic to 10.127.67.192 on port 4501/TCP
  Warning  PacketDrop              28s (x3 over 4m38s)  kube-iptables-tailer     Packet dropped when sending traffic to 10.125.137.134 on port 4501/TCP
  Warning  PacketDrop              29s (x3 over 4m39s)  kube-iptables-tailer     Packet dropped when sending traffic to 10.112.219.169 on port 4501/TCP
  Warning  PacketDrop              29s (x3 over 4m39s)  kube-iptables-tailer     Packet dropped when sending traffic to test-mailhog-gbfkpf (10.127.220.253) on port 4501/TCP
  Warning  PacketDrop              30s (x3 over 4m40s)  kube-iptables-tailer     Packet dropped when sending traffic to dev (10.112.219.173) on port 4501/TCP
  Warning  PacketDrop              30s (x3 over 4m40s)  kube-iptables-tailer     Packet dropped when sending traffic to preview-mailhog-pr-143 (10.127.220.203) on port 4501/TCP
  Warning  PacketDrop              30s (x3 over 4m40s)  kube-iptables-tailer     Packet dropped when sending traffic to 10.112.219.190 on port 4501/TCP
  Warning  PacketDrop              30s (x3 over 4m40s)  kube-iptables-tailer     Packet dropped when sending traffic to prod (10.114.205.63) on port 4501/TCP
  Warning  PacketDrop              33s (x3 over 4m43s)  kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              29s (x3 over 4m39s)    kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop  32s (x3 over 4m42s)   kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              33s (x3 over 4m43s)  kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              33s (x3 over 4m43s)  kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              28s (x3 over 4m38s)  kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop  32s (x3 over 4m42s)  kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              33s (x3 over 4m43s)  kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              30s (x3 over 4m40s)    kube-iptables-tailer     Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              34s (x3 over 4m44s)    kube-iptables-tailer     Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              31s (x3 over 4m41s)    kube-iptables-tailer     Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              33s (x3 over 4m43s)    kube-iptables-tailer     Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              29s (x3 over 4m39s)   kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              34s (x3 over 4m44s)    kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              34s (x3 over 4m44s)    kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              29s (x3 over 4m39s)    kube-iptables-tailer     Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              31s (x3 over 4m41s)  kube-iptables-tailer     Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
  Warning  PacketDrop              35s (x3 over 4m45s)  kube-iptables-tailer  Packet dropped when receiving traffic from fdb-operator (10.125.137.1) on port 4501/TCP
root@cowboy:~#

Is this expected, or due to some sort of FDB "discovery" process?

Thanks! D

funkypenguin avatar Jul 16 '21 22:07 funkypenguin

The operator only tries to connect to 4501 right? That seems like this is related to a fdbcli or fdb client bahviour. You might also ask this question in https://forums.foundationdb.org. All these Pods are are clients of a FoundationDB cluster?

johscheuer avatar Jul 19 '21 06:07 johscheuer

I would only expect the operator to talk to FDB pods. However, if pod get rescheduled with different IPs, there could be components in the database that continue to try connecting to old IPs.

brownleej avatar Jul 20 '21 15:07 brownleej

I've made some progress debugging this.

In our environment, we spin up/down an FDB cluster every time we test a commit. It seems that fdb-operator is still trying to talk to the IPs which were used for fdb pods which have since been destroyed. These IPs then get re-used in another test environment (potentially for entirely unrelated pods), and trigger network policy drops.

Then I debugged further.. it doesn't seem as if the operator "notices" (or at least logs) when a foundationdbcluster resource is deleted. Here are the latest logs from the operator:

{"level":"info","ts":1628115449.006371,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.ExcludeInstances"}
{"level":"info","ts":1628115449.0065224,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.ChangeCoordinators"}
{"level":"info","ts":1628115449.0065928,"logger":"fdbclient","msg":"Running command","namespace":"preview-database-pr-189","cluster":"retort-fdb","path":"/usr/bin/fdb/6.3/fdbcli","args":["/usr/bin/fdb/6.3/fdbcli","--exec","status minimal","-C","/tmp/847471480","--log","--trace_format","xml","--timeout","10","--log-dir","/var/log/fdb"]}
{"level":"info","ts":1628115449.1243062,"logger":"fdbclient","msg":"Command completed","namespace":"preview-database-pr-189","cluster":"retort-fdb","output":"The database is avai..."}
{"level":"info","ts":1628115449.124418,"logger":"fdbclient","msg":"Fetch status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.1551352,"logger":"fdbclient","msg":"Successfully fetched status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.1554048,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.BounceProcesses"}
{"level":"info","ts":1628115449.1555007,"logger":"fdbclient","msg":"Fetch status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.1832068,"logger":"fdbclient","msg":"Successfully fetched status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.1833994,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.UpdatePods"}
{"level":"info","ts":1628115449.184687,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.RemoveServices"}
{"level":"info","ts":1628115449.1847312,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.RemoveProcessGroups"}
{"level":"info","ts":1628115449.1848888,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"preview-database-pr-189","cluster":"retort-fdb","subReconciler":"controllers.UpdateStatus"}
{"level":"info","ts":1628115449.1849866,"logger":"fdbclient","msg":"Fetch status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.214518,"logger":"fdbclient","msg":"Successfully fetched status from FDB","namespace":"preview-database-pr-189","cluster":"retort-fdb"}
{"level":"info","ts":1628115449.2692754,"logger":"controller","msg":"Reconciliation complete","namespace":"preview-database-pr-189","cluster":"retort-fdb","generation":1}

I deleted the resource about 5 min ago:

root@cowboy:~# kubectl delete foundationdbclusters.apps.foundationdb.org -n preview-database-pr-189 retort-fdb
foundationdbcluster.apps.foundationdb.org "retort-fdb" deleted
root@cowboy:~#

And the operator hasn't logged anything about it.

Presumably this is causing the fdbcli process to continue trying to connect to the IPs originally used in this cluster.

So, my conclusion is, the errors I'm seeing are because the operator doesn't react when a CR is deleted.

Cheers! D

funkypenguin avatar Aug 04 '21 22:08 funkypenguin

We used to have code in the operator to react to a CR being deleted and clear up the cache of the database connections, but we removed that cache because it proved to be too brittle. Maybe what you're observing is that the operator is keeping connections open to the database, and doesn't close them when the CR is deleted.

brownleej avatar Aug 05 '21 16:08 brownleej

Mmm.. given we could spin up/down 50-100 clusters per day, this could be a problem :)

What can I do to avoid this? One obvious idea is to deploy an instance of operator into each namespace as I deploy its fdb cluster, and disable global mode. (IIRC this caused problems last time I tried due to the fact that multiple operators would clash on creating CRDs)

Any other suggestions? :)

Thanks! D

funkypenguin avatar Aug 05 '21 21:08 funkypenguin

Deploying an instance in the operator to each namespace should work. If you're going that route, you would want to separate out the deployment of the CRD, since you only need to deploy that once for each Kubernetes cluster. Another option is to delete the operator pod, which should cause it to get recreated without any of the old connections.

brownleej avatar Aug 06 '21 15:08 brownleej

Ideally we should address this in the operator, by closing the connections automatically, but I'm not sure what that would entail.

brownleej avatar Aug 06 '21 15:08 brownleej

@funkypenguin are you still seeing this issue with a recent version? The operator is not using the cached mechanism anymore and directly closes the db connections once it's done fetching the data, so I would assume that this issue is resolved.

johscheuer avatar Jul 02 '22 08:07 johscheuer

Oooh, that could be useful! I'll test and let you know...

funkypenguin avatar Jul 03 '22 21:07 funkypenguin

I'm going to close this issue. Feel free to reopen this issue if you still see this behaviour.

johscheuer avatar Jan 19 '23 07:01 johscheuer