fdb-kubernetes-operator icon indicating copy to clipboard operation
fdb-kubernetes-operator copied to clipboard

Deferring reconciliation due to unsupported clients (coordinators)

Open gm42 opened this issue 10 months ago • 7 comments

What happened?

I run a cluster with separate coordinator pods which is using an older FoundationDB version e.g. 7.1.17. Once I set the version to latest 7.3.27, operator never performs the upgrade because of this error:

{"level":"info","ts":"2024-04-16T06:22:20Z","logger":"controller","msg":"Reconciliation terminated early","namespace":"skunkworks","cluster":"mycluster","reconciler":"controllers.checkClientCompatibility","requeueAfter":60,"message":"7 clients do not support version 7.3.27: 192.168.149.171:49018, 192.168.149.171:52564, 192.168.159.185:52772, 192.168.159.185:55498, 192.168.190.37:36264, 192.168.190.37:42684, 192.168.190.37:52468"}

However those IPs correspond to the pod IPs of 3 of the coordinators (!), not an external client (from cluster PoV).

What did you expect to happen?

Operator carrying out the update by pod replacement, not considering connections from coordinators as blocking.

How can we reproduce it (as minimally and precisely as possible)?

Create a cluster:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: mycluster
  namespace: skunkworks
spec:
  version: 7.1.17
  routing:
    publicIPSource: service
  coordinatorSelection:
  - priority: 100
    processClass: coordinator
  databaseConfiguration:
    logs: 3
    storage: 3
    redundancy_mode: single
    storage_engine: ssd-2
  processCounts:
    coordinator: 3
    stateless: 3
    storage: 3
    log: 3

Start using it using some other client, so that there is some activity on it; then change its version to 7.3.27 using kubectl edit.

Anything else we need to know?

No response

FDB Kubernetes operator

v1.33

Kubernetes version

$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.26.14-eks-b9c9ed7

Cloud provider

AWS

gm42 avatar Apr 16 '24 06:04 gm42

Could you please verify if there are any other processes/tools running in our setup? I was not able to reproduce the issue. Is it possible that you ran fdbcli from those Pods?

johscheuer avatar Apr 17 '24 09:04 johscheuer

See: https://github.com/FoundationDB/fdb-kubernetes-operator/pull/1995

johscheuer avatar Apr 17 '24 09:04 johscheuer

Could you please verify if there are any other processes/tools running in our setup? I was not able to reproduce the issue. Is it possible that you ran fdbcli from those Pods?

There are other processes connected to the cluster, but they have client libraries for both versions and their IP is not reported, only some of the coordinator IPs are reported as being incompatible. I do occasionally run fdbcli from coordinator/storage pods to check cluster status, the fdbcli process exits as soon as the status is returned, AFAIK.

See: https://github.com/FoundationDB/fdb-kubernetes-operator/pull/1995

Thanks for adding this; I will try to iterate on the testcase, perhaps it has to do with the presence of other clients using the coordinators?

gm42 avatar Apr 19 '24 06:04 gm42

If you run fdbcli with the 7.1+, then fdbcli will appear as client connected to the cluster. It will take some time until FDB cleans up the "old" clients.

johscheuer avatar Apr 19 '24 11:04 johscheuer

I see, thanks for the explanation; isn't operator also using fdbcli as a fallback way to read status? I can check in logs if it's actually doing that, for some reason.

Is this time to reap old clients configurable or is it based on the lifecycle of half-open TCP connections?

gm42 avatar Apr 19 '24 11:04 gm42

I see, thanks for the explanation; isn't operator also using fdbcli as a fallback way to read status? I can check in logs if it's actually doing that, for some reason.

It does, but only if the call with the bindings didn't work. The fdbcli call from the operator will be from the operator Pod directly with the log group fdb-kubernetes-operator and that log group will be filtered out in the compatibility check: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/api/v1beta2/foundationdbcluster_types.go#L1168-L1171

Is this time to reap old clients configurable or is it based on the lifecycle of half-open TCP connections?

It's configurable: https://github.com/apple/foundationdb/blob/main/fdbserver/include/fdbserver/ClusterController.actor.h#L247-L262. The knob CC_PRUNE_CLIENTS_INTERVAL can be set on the CC and is per default 60.0 (seconds). Make sure to properly test changes on those knobs as there might be side-effects.

johscheuer avatar Apr 19 '24 12:04 johscheuer

Another option would be to filter out reported incompatible connections from IPs/machines that host the fdbserver process. We would need to document this change and in general we are not recommending to run any applications in the same Pod as the fdbserver processes are running in, so we should be fine doing that.

I guess the TLDR for the issue/question above is if you stop using fdbcli for a while the operator should move forward with the upgrade (once the clients are removed from the list).

johscheuer avatar Apr 19 '24 14:04 johscheuer

Thanks for the fix! I noticed just now that it was merged; I could not reproduce the issue and generally I only run fdbcli on the pods to check status, so it should not linger behind, but the mitigation looks solid!

gm42 avatar Jun 18 '24 11:06 gm42