fdb-kubernetes-operator icon indicating copy to clipboard operation
fdb-kubernetes-operator copied to clipboard

After service ip recreated, fdb.cluster is still using old service ip

Open thorn126 opened this issue 3 years ago • 2 comments

What happened?

if using

    routing:
      publicIPSource: service

after svc got deleted, it will be recreated like:

oc get svc
NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
mdm-foundationdb-ibm-log-1         ClusterIP   172.30.1.229     <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-log-2         ClusterIP   172.30.197.98    <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-log-3         ClusterIP   172.30.13.176    <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-log-4         ClusterIP   172.30.96.172    <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-proxy-1       ClusterIP   172.30.214.129   <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-proxy-2       ClusterIP   172.30.215.209   <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-stateless-1   ClusterIP   172.30.216.201   <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-storage-1     ClusterIP   172.30.230.42    <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-storage-2     ClusterIP   172.30.226.243   <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-storage-3     ClusterIP   172.30.218.154   <none>        4500/TCP,4501/TCP   4m33s
oc delete svc mdm-foundationdb-ibm-storage-1 mdm-foundationdb-ibm-storage-2 mdm-foundationdb-ibm-storage-3
service "mdm-foundationdb-ibm-storage-1" deleted
service "mdm-foundationdb-ibm-storage-2" deleted
service "mdm-foundationdb-ibm-storage-3" deleted

check service again

 oc get svc
NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
mdm-foundationdb-ibm-log-1         ClusterIP   172.30.1.229     <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-log-2         ClusterIP   172.30.197.98    <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-log-3         ClusterIP   172.30.13.176    <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-log-4         ClusterIP   172.30.96.172    <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-proxy-1       ClusterIP   172.30.214.129   <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-proxy-2       ClusterIP   172.30.215.209   <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-stateless-1   ClusterIP   172.30.216.201   <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-storage-1     ClusterIP   172.30.104.200   <none>        4500/TCP,4501/TCP   4s
mdm-foundationdb-ibm-storage-2     ClusterIP   172.30.171.137   <none>        4500/TCP,4501/TCP   4s
mdm-foundationdb-ibm-storage-3     ClusterIP   172.30.240.39    <none>        4500/TCP,4501/TCP   4s

check fdb.cluster, it is still using old service ip

 cat /var/dynamic-conf/fdb.cluster
mdm_foundationdb_ibm:[email protected]:4500:tls,172.30.226.243:4500:tls,172.30.218.154:4500:tls

now connect fdbcluster will fail

oc exec -it mdm-foundationdb-ibm-storage-2 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "foundationdb" out of: foundationdb, foundationdb-kubernetes-sidecar, foundationdb-kubernetes-init (init)
bash-4.4$ fdbcli
Using cluster file `/var/dynamic-conf/fdb.cluster'.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status details

Using cluster file `/var/dynamic-conf/fdb.cluster'.

Could not communicate with a quorum of coordination servers:
  172.30.218.154:4500:tls  (unreachable)
  172.30.226.243:4500:tls  (unreachable)
  172.30.230.42:4500:tls  (unreachable)

after service ip changes, it should update fdb.cluster or fdb is not available although the status said healthy which is not correct.

 oc get foundationdbcluster mdm-foundationdb-ibm -o yaml |grep health
  health:
    healthy: true

What did you expect to happen?

fdb.cluster should be updated with updated service ips

How can we reproduce it (as minimally and precisely as possible)?

  1. in cr
    routing:
      publicIPSource: service
  1. delete services (delete coordinator services)

Anything else we need to know?

No response

FDB Kubernetes operator

$ kubectl fdb version
# paste output here
0.48.0

Kubernetes version

$ kubectl version
# paste output here
kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v0.21.0-beta.1", GitCommit:"88e7eba36ac7cc026d064bb6e4a988acc18a144e", GitTreeState:"clean", BuildDate:"2021-09-21T10:07:38Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5+3afdacb", GitCommit:"3c28e7a79b58e78b4c1dc1ab7e5f6c6c2d3aedd3", GitTreeState:"clean", BuildDate:"2022-05-10T16:30:48Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (0.21) and server (1.23) exceeds the supported minor version skew of +/-1

Cloud provider

OCP

thorn126 avatar Oct 17 '22 07:10 thorn126

Have you tried the same changes with a more recent operator version? Could you share the operator logs from the timespan you deleted the services?

Just to clarify you only deleted the service but you left the Pod in it's running state (so only the service IP was changed but no the Pod IP)? Is there an actual need for you to delete those services or is the concern to prevent issues if someone/something accidentally delete services?

johscheuer avatar Oct 17 '22 08:10 johscheuer

Thank you @johscheuer for your prompt response, I'll try to use the latest version and attach logs. Yes, when deleting services, pods are still running, didn't do other things. Chances are either service got deleted accidentally or the service is not stable, and somehow the service itself was lost.

thorn126 avatar Oct 17 '22 08:10 thorn126

I tried latest operator v1.9.0, still the same issue. applelog.zip let me if you need more info. @johscheuer

thorn126 avatar Oct 17 '22 09:10 thorn126

Just for my understanding was there only a single service that was recreated or all at once? If a single service is deleted the operator should be able to select a new coordinator or update the IP address. If multiple or all services are deleted this will get a little bit more complicated and the operator doesn't support auto recovery from there.

johscheuer avatar Oct 17 '22 09:10 johscheuer

Yes, I deleted 3 storage services which are all coordinator service. May I know, if a single service is deleted and could be recovered, any reason why multiple services couldn't do the similar? thanks.

thorn126 avatar Oct 17 '22 11:10 thorn126

Also from the controller log, it is complaining "Service \"test-cluster-cluster-controller-1\" not found", but the fact is storage services got recreated. this msg sounds confusing. {"level":"error","ts":1665997062.672065,"logger":"controller","msg":"Error in reconciliation","namespace":"testoperator2","cluster":"test-cluster","subReconciler":"controllers.addPods","requeueAfter":0,"error":"Service \"test-cluster-cluster-controller-1\" not found","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:183\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

thorn126 avatar Oct 17 '22 12:10 thorn126

 oc get svc
NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
test-cluster-cluster-controller-2   ClusterIP   172.30.152.169   <none>        4500/TCP,4501/TCP   3h29m
test-cluster-log-5                  ClusterIP   172.30.13.115    <none>        4500/TCP,4501/TCP   3h29m
test-cluster-log-6                  ClusterIP   172.30.104.156   <none>        4500/TCP,4501/TCP   3h29m
test-cluster-log-7                  ClusterIP   172.30.174.62    <none>        4500/TCP,4501/TCP   3h29m
test-cluster-log-8                  ClusterIP   172.30.56.238    <none>        4500/TCP,4501/TCP   3h29m
test-cluster-storage-4              ClusterIP   172.30.151.4     <none>        4500/TCP,4501/TCP   3h27m
test-cluster-storage-5              ClusterIP   172.30.105.22    <none>        4500/TCP,4501/TCP   3h27m
test-cluster-storage-6              ClusterIP   172.30.189.231   <none>        4500/TCP,4501/TCP   3h27m

oc get po
NAME                                                         READY   STATUS    RESTARTS   AGE
fdb-kubernetes-operator-controller-manager-d5d6ccd5c-kjhwn   1/1     Running   0          3h30m
test-cluster-cluster-controller-2                            2/2     Running   0          3h30m
test-cluster-log-5                                           2/2     Running   0          3h30m
test-cluster-log-6                                           2/2     Running   0          3h30m
test-cluster-log-7                                           2/2     Running   0          3h30m
test-cluster-log-8                                           2/2     Running   0          3h30m
test-cluster-storage-4                                       2/2     Running   0          3h30m
test-cluster-storage-5                                       2/2     Running   0          3h30m
test-cluster-storage-6                                       2/2     Running   0          3h30m

test-cluster-cluster-controller-1 doesn't exist though, it is test-cluster-cluster-controller-2 now, but operator still trying to get the old svc. (btw, I didn't delete cluster controller's service, it is operator itself)

thorn126 avatar Oct 17 '22 12:10 thorn126

Actually I tried more testing, even only delete one service, it won't recover

oc get svc
NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
test-cluster-cluster-controller-2   ClusterIP   172.30.182.87    <none>        4500/TCP,4501/TCP   8m5s
test-cluster-log-5                  ClusterIP   172.30.229.194   <none>        4500/TCP,4501/TCP   8m5s
test-cluster-log-6                  ClusterIP   172.30.245.142   <none>        4500/TCP,4501/TCP   8m5s
test-cluster-log-7                  ClusterIP   172.30.255.83    <none>        4500/TCP,4501/TCP   8m5s
test-cluster-log-8                  ClusterIP   172.30.90.51     <none>        4500/TCP,4501/TCP   8m5s
test-cluster-storage-4              ClusterIP   172.30.45.139    <none>        4500/TCP,4501/TCP   8m5s
test-cluster-storage-5              ClusterIP   172.30.181.57    <none>        4500/TCP,4501/TCP   8m5s
test-cluster-storage-6              ClusterIP   172.30.183.175   <none>        4500/TCP,4501/TCP   8m5s

oc get foundationdbcluster test-cluster -o yaml |grep conn
  connectionString: test_cluster:[email protected]:4501,172.30.229.194:4501,172.30.245.142:4501

delete one coordinator service

oc  delete svc test-cluster-log-8
service "test-cluster-log-8" deleted
# oc get svc
NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
test-cluster-cluster-controller-2   ClusterIP   172.30.182.87    <none>        4500/TCP,4501/TCP   9m40s
test-cluster-log-5                  ClusterIP   172.30.229.194   <none>        4500/TCP,4501/TCP   9m40s
test-cluster-log-6                  ClusterIP   172.30.245.142   <none>        4500/TCP,4501/TCP   9m40s
test-cluster-log-7                  ClusterIP   172.30.255.83    <none>        4500/TCP,4501/TCP   9m40s
test-cluster-log-8                  ClusterIP   172.30.73.139    <none>        4500/TCP,4501/TCP   13s
test-cluster-storage-4              ClusterIP   172.30.45.139    <none>        4500/TCP,4501/TCP   9m40s
test-cluster-storage-5              ClusterIP   172.30.181.57    <none>        4500/TCP,4501/TCP   9m40s
test-cluster-storage-6              ClusterIP   172.30.183.175   <none>        4500/TCP,4501/TCP   9m40s

check connectionstring

  connectionString: test_cluster:[email protected]:4501,172.30.229.194:4501,172.30.245.142:4501

it is still the old service ip. nothing indicate anything change from controller log

{"level":"info","ts":1666010517.5915344,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"testoperator2","cluster":"test-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1666010517.6078048,"logger":"controller","msg":"Reconciliation complete","namespace":"testoperator2","cluster":"test-cluster","generation":1}

if I delete all 3 coordinator services

oc delete svc test-cluster-log-5 test-cluster-log-6
service "test-cluster-log-5" deleted
service "test-cluster-log-6" deleted
# oc get svc
NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
test-cluster-cluster-controller-2   ClusterIP   172.30.182.87    <none>        4500/TCP,4501/TCP   16m
test-cluster-log-5                  ClusterIP   172.30.84.107    <none>        4500/TCP,4501/TCP   9s
test-cluster-log-7                  ClusterIP   172.30.255.83    <none>        4500/TCP,4501/TCP   16m
test-cluster-log-8                  ClusterIP   172.30.73.139    <none>        4500/TCP,4501/TCP   7m
test-cluster-storage-4              ClusterIP   172.30.45.139    <none>        4500/TCP,4501/TCP   16m
test-cluster-storage-5              ClusterIP   172.30.181.57    <none>        4500/TCP,4501/TCP   16m
test-cluster-storage-6              ClusterIP   172.30.183.175   <none>        4500/TCP,4501/TCP   16m

check controller log, this time got timeout issue which indicate fdb is not available

{"level":"info","ts":1666010925.5818737,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"testoperator2","cluster":"test-cluster","subReconciler":"controllers.excludeProcesses"}
{"level":"info","ts":1666010925.5821764,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"testoperator2","cluster":"test-cluster","subReconciler":"controllers.changeCoordinators"}
{"level":"info","ts":1666010925.582316,"logger":"fdbclient","msg":"Running command","namespace":"testoperator2","cluster":"test-cluster","path":"/usr/bin/fdb/6.3/fdbcli","args":["/usr/bin/fdb/6.3/fdbcli","--exec","option on ACCESS_SYSTEM_KEYS; get \ufffd/coordinators","-C","/tmp/3766994553","--log","--log","--trace_format","xml","--log-dir","/var/log/fdb","--timeout","10"]}
{"level":"error","ts":1666010935.6083379,"logger":"fdbclient","msg":"Error from FDB command","namespace":"testoperator2","cluster":"test-cluster","code":1,"stdout":"\nWARNING: Long delay (Ctrl-C to interrupt)\nSpecified timeout reached -- exiting...\n>>> option on ACCESS_SYSTEM_KEYS\nOption enabled for all transactions\n>>> get \\xff/coordinators\n\nThe database is unavailable; type `status' for more information.\n\n","stderr":"","error":"exit status 1","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).runCommandWithBackoff\n\t/workspace/fdbclient/admin_client.go:261\ngithub.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).GetConnectionString\n\t/workspace/fdbclient/admin_client.go:632\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.changeCoordinators.reconcile\n\t/workspace/controllers/change_coordinators.go:50\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

thorn126 avatar Oct 17 '22 12:10 thorn126

@johscheuer From the testing, if we delete more than half of all storage service ip, like two of three coordinator service ip, if we go into pod and run fdbcli status details, we'll know The database is unavailable, check apple controller log, it also indicate db is not available, sth like "error":"unable to fetch connection string: The database is unavailable; type status' for more information.\n"`, but check

oc get foundationdbcluster mdm-foundationdb-ibm -o yaml |grep health
  health:
    healthy: true

oc get foundationdbcluster mdm-foundationdb-ibm -o yaml |grep avai
    available: true

why the status of foundationdbcluster status is not correct? when will it got updated to unavailable?

thorn126 avatar Oct 18 '22 14:10 thorn126

until I restart apple controller, the health is not true

thorn126 avatar Oct 18 '22 14:10 thorn126

Yes, I deleted 3 storage services which are all coordinator service. May I know, if a single service is deleted and could be recovered, any reason why multiple services couldn't do the similar? thanks.

Deleting a single service should be fine and the operator should recover that state, if that is not the case there might be a bug in the code. To be precise deleting all services for the coordinator Pods (in those case storage Pods) has the result that the operator is not being able anymore to connect to the database since the connection string will contain the old service IPs. In theory it might be possible that the operator could recover from this case but it's definitely not well tested.

why the status of foundationdbcluster status is not correct? when will it got updated to unavailable?

That's probably simply a bug :)

Feel free to look into those issues and create PRs with the respective fixes. We don't use the service implementation and the plan forward is to use DNS instead of service IPs.

johscheuer avatar Oct 19 '22 07:10 johscheuer

Thank you @johscheuer for the explaination, I tried, if we delete one coordinator service ip, although one coordinator ip is not reachable, but fdb is still available, so it won't impact much. After deleteing the corresponding pod, all coordinators are reachable. if we delete more than half of the coordinator service ip, then I restart fdb cluster pods, then run fix-coordinator-ips command, it actually recovered. For foundationdbcluster status is not correct, since you mentioned we don't use service, so will use to check apple controller log to determine fdb is not available to workaround this issue. Closing this issue since I managed to make it work. thanks for all your help. @johscheuer

thorn126 avatar Oct 20 '22 13:10 thorn126

I believe it's still a bug that the coordinators are not updated when you delete a single service, so feel to reopen it and provide a fix for it.

johscheuer avatar Oct 21 '22 13:10 johscheuer

@johscheuer If we want to fix it, it should be in the latest version, right? but you mentioned the plan forward is to use DNS instead of service IPs, this issue was found when using service IP, let me know if you still think we want to fix it even it's not strategy, I'll reopen it. thanks.

thorn126 avatar Oct 24 '22 00:10 thorn126