fdb-kubernetes-operator
fdb-kubernetes-operator copied to clipboard
Deferring reconciliation due to unsupported clients (coordinators)
What happened?
I run a cluster with separate coordinator pods which is using an older FoundationDB version e.g. 7.1.17
.
Once I set the version to latest 7.3.27
, operator never performs the upgrade because of this error:
{"level":"info","ts":"2024-04-16T06:22:20Z","logger":"controller","msg":"Reconciliation terminated early","namespace":"skunkworks","cluster":"mycluster","reconciler":"controllers.checkClientCompatibility","requeueAfter":60,"message":"7 clients do not support version 7.3.27: 192.168.149.171:49018, 192.168.149.171:52564, 192.168.159.185:52772, 192.168.159.185:55498, 192.168.190.37:36264, 192.168.190.37:42684, 192.168.190.37:52468"}
However those IPs correspond to the pod IPs of 3 of the coordinators (!), not an external client (from cluster PoV).
What did you expect to happen?
Operator carrying out the update by pod replacement, not considering connections from coordinators as blocking.
How can we reproduce it (as minimally and precisely as possible)?
Create a cluster:
apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
name: mycluster
namespace: skunkworks
spec:
version: 7.1.17
routing:
publicIPSource: service
coordinatorSelection:
- priority: 100
processClass: coordinator
databaseConfiguration:
logs: 3
storage: 3
redundancy_mode: single
storage_engine: ssd-2
processCounts:
coordinator: 3
stateless: 3
storage: 3
log: 3
Start using it using some other client, so that there is some activity on it; then change its version to 7.3.27
using kubectl edit
.
Anything else we need to know?
No response
FDB Kubernetes operator
Kubernetes version
$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.26.14-eks-b9c9ed7
Cloud provider
Could you please verify if there are any other processes/tools running in our setup? I was not able to reproduce the issue. Is it possible that you ran fdbcli
from those Pods?
See: https://github.com/FoundationDB/fdb-kubernetes-operator/pull/1995
Could you please verify if there are any other processes/tools running in our setup? I was not able to reproduce the issue. Is it possible that you ran
fdbcli
from those Pods?
There are other processes connected to the cluster, but they have client libraries for both versions and their IP is not reported, only some of the coordinator IPs are reported as being incompatible.
I do occasionally run fdbcli
from coordinator/storage pods to check cluster status, the fdbcli
process exits as soon as the status is returned, AFAIK.
See: https://github.com/FoundationDB/fdb-kubernetes-operator/pull/1995
Thanks for adding this; I will try to iterate on the testcase, perhaps it has to do with the presence of other clients using the coordinators?
If you run fdbcli
with the 7.1+, then fdbcli
will appear as client connected to the cluster. It will take some time until FDB cleans up the "old" clients.
I see, thanks for the explanation; isn't operator also using fdbcli
as a fallback way to read status? I can check in logs if it's actually doing that, for some reason.
Is this time to reap old clients configurable or is it based on the lifecycle of half-open TCP connections?
I see, thanks for the explanation; isn't operator also using
fdbcli
as a fallback way to read status? I can check in logs if it's actually doing that, for some reason.
It does, but only if the call with the bindings didn't work. The fdbcli
call from the operator will be from the operator Pod directly with the log group fdb-kubernetes-operator and that log group will be filtered out in the compatibility check: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/api/v1beta2/foundationdbcluster_types.go#L1168-L1171
Is this time to reap old clients configurable or is it based on the lifecycle of half-open TCP connections?
It's configurable: https://github.com/apple/foundationdb/blob/main/fdbserver/include/fdbserver/ClusterController.actor.h#L247-L262. The knob CC_PRUNE_CLIENTS_INTERVAL
can be set on the CC
and is per default 60.0
(seconds). Make sure to properly test changes on those knobs as there might be side-effects.
Another option would be to filter out reported incompatible connections from IPs/machines that host the fdbserver process. We would need to document this change and in general we are not recommending to run any applications in the same Pod as the fdbserver processes are running in, so we should be fine doing that.
I guess the TLDR for the issue/question above is if you stop using fdbcli
for a while the operator should move forward with the upgrade (once the clients are removed from the list).
Thanks for the fix! I noticed just now that it was merged; I could not reproduce the issue and generally I only run fdbcli
on the pods to check status, so it should not linger behind, but the mitigation looks solid!