fdb-kubernetes-operator
fdb-kubernetes-operator copied to clipboard
After service ip recreated, fdb.cluster is still using old service ip
What happened?
if using
routing:
publicIPSource: service
after svc got deleted, it will be recreated like:
oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mdm-foundationdb-ibm-log-1 ClusterIP 172.30.1.229 <none> 4500/TCP,4501/TCP 4m33s
mdm-foundationdb-ibm-log-2 ClusterIP 172.30.197.98 <none> 4500/TCP,4501/TCP 4m33s
mdm-foundationdb-ibm-log-3 ClusterIP 172.30.13.176 <none> 4500/TCP,4501/TCP 4m33s
mdm-foundationdb-ibm-log-4 ClusterIP 172.30.96.172 <none> 4500/TCP,4501/TCP 4m33s
mdm-foundationdb-ibm-proxy-1 ClusterIP 172.30.214.129 <none> 4500/TCP,4501/TCP 4m33s
mdm-foundationdb-ibm-proxy-2 ClusterIP 172.30.215.209 <none> 4500/TCP,4501/TCP 4m33s
mdm-foundationdb-ibm-stateless-1 ClusterIP 172.30.216.201 <none> 4500/TCP,4501/TCP 4m33s
mdm-foundationdb-ibm-storage-1 ClusterIP 172.30.230.42 <none> 4500/TCP,4501/TCP 4m33s
mdm-foundationdb-ibm-storage-2 ClusterIP 172.30.226.243 <none> 4500/TCP,4501/TCP 4m33s
mdm-foundationdb-ibm-storage-3 ClusterIP 172.30.218.154 <none> 4500/TCP,4501/TCP 4m33s
oc delete svc mdm-foundationdb-ibm-storage-1 mdm-foundationdb-ibm-storage-2 mdm-foundationdb-ibm-storage-3
service "mdm-foundationdb-ibm-storage-1" deleted
service "mdm-foundationdb-ibm-storage-2" deleted
service "mdm-foundationdb-ibm-storage-3" deleted
check service again
oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mdm-foundationdb-ibm-log-1 ClusterIP 172.30.1.229 <none> 4500/TCP,4501/TCP 5m15s
mdm-foundationdb-ibm-log-2 ClusterIP 172.30.197.98 <none> 4500/TCP,4501/TCP 5m15s
mdm-foundationdb-ibm-log-3 ClusterIP 172.30.13.176 <none> 4500/TCP,4501/TCP 5m15s
mdm-foundationdb-ibm-log-4 ClusterIP 172.30.96.172 <none> 4500/TCP,4501/TCP 5m15s
mdm-foundationdb-ibm-proxy-1 ClusterIP 172.30.214.129 <none> 4500/TCP,4501/TCP 5m15s
mdm-foundationdb-ibm-proxy-2 ClusterIP 172.30.215.209 <none> 4500/TCP,4501/TCP 5m15s
mdm-foundationdb-ibm-stateless-1 ClusterIP 172.30.216.201 <none> 4500/TCP,4501/TCP 5m15s
mdm-foundationdb-ibm-storage-1 ClusterIP 172.30.104.200 <none> 4500/TCP,4501/TCP 4s
mdm-foundationdb-ibm-storage-2 ClusterIP 172.30.171.137 <none> 4500/TCP,4501/TCP 4s
mdm-foundationdb-ibm-storage-3 ClusterIP 172.30.240.39 <none> 4500/TCP,4501/TCP 4s
check fdb.cluster, it is still using old service ip
cat /var/dynamic-conf/fdb.cluster
mdm_foundationdb_ibm:[email protected]:4500:tls,172.30.226.243:4500:tls,172.30.218.154:4500:tls
now connect fdbcluster will fail
oc exec -it mdm-foundationdb-ibm-storage-2 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "foundationdb" out of: foundationdb, foundationdb-kubernetes-sidecar, foundationdb-kubernetes-init (init)
bash-4.4$ fdbcli
Using cluster file `/var/dynamic-conf/fdb.cluster'.
The database is unavailable; type `status' for more information.
Welcome to the fdbcli. For help, type `help'.
fdb> status details
Using cluster file `/var/dynamic-conf/fdb.cluster'.
Could not communicate with a quorum of coordination servers:
172.30.218.154:4500:tls (unreachable)
172.30.226.243:4500:tls (unreachable)
172.30.230.42:4500:tls (unreachable)
after service ip changes, it should update fdb.cluster or fdb is not available although the status said healthy which is not correct.
oc get foundationdbcluster mdm-foundationdb-ibm -o yaml |grep health
health:
healthy: true
What did you expect to happen?
fdb.cluster should be updated with updated service ips
How can we reproduce it (as minimally and precisely as possible)?
- in cr
routing:
publicIPSource: service
- delete services (delete coordinator services)
Anything else we need to know?
No response
FDB Kubernetes operator
$ kubectl fdb version
# paste output here
Kubernetes version
$ kubectl version
# paste output here
kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v0.21.0-beta.1", GitCommit:"88e7eba36ac7cc026d064bb6e4a988acc18a144e", GitTreeState:"clean", BuildDate:"2021-09-21T10:07:38Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5+3afdacb", GitCommit:"3c28e7a79b58e78b4c1dc1ab7e5f6c6c2d3aedd3", GitTreeState:"clean", BuildDate:"2022-05-10T16:30:48Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (0.21) and server (1.23) exceeds the supported minor version skew of +/-1
Cloud provider
Have you tried the same changes with a more recent operator version? Could you share the operator logs from the timespan you deleted the services?
Just to clarify you only deleted the service but you left the Pod in it's running state (so only the service IP was changed but no the Pod IP)? Is there an actual need for you to delete those services or is the concern to prevent issues if someone/something accidentally delete services?
Thank you @johscheuer for your prompt response, I'll try to use the latest version and attach logs. Yes, when deleting services, pods are still running, didn't do other things. Chances are either service got deleted accidentally or the service is not stable, and somehow the service itself was lost.
I tried latest operator v1.9.0, still the same issue. applelog.zip let me if you need more info. @johscheuer
Just for my understanding was there only a single service that was recreated or all at once? If a single service is deleted the operator should be able to select a new coordinator or update the IP address. If multiple or all services are deleted this will get a little bit more complicated and the operator doesn't support auto recovery from there.
Yes, I deleted 3 storage services which are all coordinator service. May I know, if a single service is deleted and could be recovered, any reason why multiple services couldn't do the similar? thanks.
Also from the controller log, it is complaining "Service \"test-cluster-cluster-controller-1\" not found", but the fact is storage services got recreated. this msg sounds confusing.
{"level":"error","ts":1665997062.672065,"logger":"controller","msg":"Error in reconciliation","namespace":"testoperator2","cluster":"test-cluster","subReconciler":"controllers.addPods","requeueAfter":0,"error":"Service \"test-cluster-cluster-controller-1\" not found","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:183\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
test-cluster-cluster-controller-2 ClusterIP 172.30.152.169 <none> 4500/TCP,4501/TCP 3h29m
test-cluster-log-5 ClusterIP 172.30.13.115 <none> 4500/TCP,4501/TCP 3h29m
test-cluster-log-6 ClusterIP 172.30.104.156 <none> 4500/TCP,4501/TCP 3h29m
test-cluster-log-7 ClusterIP 172.30.174.62 <none> 4500/TCP,4501/TCP 3h29m
test-cluster-log-8 ClusterIP 172.30.56.238 <none> 4500/TCP,4501/TCP 3h29m
test-cluster-storage-4 ClusterIP 172.30.151.4 <none> 4500/TCP,4501/TCP 3h27m
test-cluster-storage-5 ClusterIP 172.30.105.22 <none> 4500/TCP,4501/TCP 3h27m
test-cluster-storage-6 ClusterIP 172.30.189.231 <none> 4500/TCP,4501/TCP 3h27m
oc get po
NAME READY STATUS RESTARTS AGE
fdb-kubernetes-operator-controller-manager-d5d6ccd5c-kjhwn 1/1 Running 0 3h30m
test-cluster-cluster-controller-2 2/2 Running 0 3h30m
test-cluster-log-5 2/2 Running 0 3h30m
test-cluster-log-6 2/2 Running 0 3h30m
test-cluster-log-7 2/2 Running 0 3h30m
test-cluster-log-8 2/2 Running 0 3h30m
test-cluster-storage-4 2/2 Running 0 3h30m
test-cluster-storage-5 2/2 Running 0 3h30m
test-cluster-storage-6 2/2 Running 0 3h30m
test-cluster-cluster-controller-1 doesn't exist though, it is test-cluster-cluster-controller-2 now, but operator still trying to get the old svc. (btw, I didn't delete cluster controller's service, it is operator itself)
Actually I tried more testing, even only delete one service, it won't recover
oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
test-cluster-cluster-controller-2 ClusterIP 172.30.182.87 <none> 4500/TCP,4501/TCP 8m5s
test-cluster-log-5 ClusterIP 172.30.229.194 <none> 4500/TCP,4501/TCP 8m5s
test-cluster-log-6 ClusterIP 172.30.245.142 <none> 4500/TCP,4501/TCP 8m5s
test-cluster-log-7 ClusterIP 172.30.255.83 <none> 4500/TCP,4501/TCP 8m5s
test-cluster-log-8 ClusterIP 172.30.90.51 <none> 4500/TCP,4501/TCP 8m5s
test-cluster-storage-4 ClusterIP 172.30.45.139 <none> 4500/TCP,4501/TCP 8m5s
test-cluster-storage-5 ClusterIP 172.30.181.57 <none> 4500/TCP,4501/TCP 8m5s
test-cluster-storage-6 ClusterIP 172.30.183.175 <none> 4500/TCP,4501/TCP 8m5s
oc get foundationdbcluster test-cluster -o yaml |grep conn
connectionString: test_cluster:[email protected]:4501,172.30.229.194:4501,172.30.245.142:4501
delete one coordinator service
oc delete svc test-cluster-log-8
service "test-cluster-log-8" deleted
# oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
test-cluster-cluster-controller-2 ClusterIP 172.30.182.87 <none> 4500/TCP,4501/TCP 9m40s
test-cluster-log-5 ClusterIP 172.30.229.194 <none> 4500/TCP,4501/TCP 9m40s
test-cluster-log-6 ClusterIP 172.30.245.142 <none> 4500/TCP,4501/TCP 9m40s
test-cluster-log-7 ClusterIP 172.30.255.83 <none> 4500/TCP,4501/TCP 9m40s
test-cluster-log-8 ClusterIP 172.30.73.139 <none> 4500/TCP,4501/TCP 13s
test-cluster-storage-4 ClusterIP 172.30.45.139 <none> 4500/TCP,4501/TCP 9m40s
test-cluster-storage-5 ClusterIP 172.30.181.57 <none> 4500/TCP,4501/TCP 9m40s
test-cluster-storage-6 ClusterIP 172.30.183.175 <none> 4500/TCP,4501/TCP 9m40s
check connectionstring
connectionString: test_cluster:[email protected]:4501,172.30.229.194:4501,172.30.245.142:4501
it is still the old service ip. nothing indicate anything change from controller log
{"level":"info","ts":1666010517.5915344,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"testoperator2","cluster":"test-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1666010517.6078048,"logger":"controller","msg":"Reconciliation complete","namespace":"testoperator2","cluster":"test-cluster","generation":1}
if I delete all 3 coordinator services
oc delete svc test-cluster-log-5 test-cluster-log-6
service "test-cluster-log-5" deleted
service "test-cluster-log-6" deleted
# oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
test-cluster-cluster-controller-2 ClusterIP 172.30.182.87 <none> 4500/TCP,4501/TCP 16m
test-cluster-log-5 ClusterIP 172.30.84.107 <none> 4500/TCP,4501/TCP 9s
test-cluster-log-7 ClusterIP 172.30.255.83 <none> 4500/TCP,4501/TCP 16m
test-cluster-log-8 ClusterIP 172.30.73.139 <none> 4500/TCP,4501/TCP 7m
test-cluster-storage-4 ClusterIP 172.30.45.139 <none> 4500/TCP,4501/TCP 16m
test-cluster-storage-5 ClusterIP 172.30.181.57 <none> 4500/TCP,4501/TCP 16m
test-cluster-storage-6 ClusterIP 172.30.183.175 <none> 4500/TCP,4501/TCP 16m
check controller log, this time got timeout issue which indicate fdb is not available
{"level":"info","ts":1666010925.5818737,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"testoperator2","cluster":"test-cluster","subReconciler":"controllers.excludeProcesses"}
{"level":"info","ts":1666010925.5821764,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"testoperator2","cluster":"test-cluster","subReconciler":"controllers.changeCoordinators"}
{"level":"info","ts":1666010925.582316,"logger":"fdbclient","msg":"Running command","namespace":"testoperator2","cluster":"test-cluster","path":"/usr/bin/fdb/6.3/fdbcli","args":["/usr/bin/fdb/6.3/fdbcli","--exec","option on ACCESS_SYSTEM_KEYS; get \ufffd/coordinators","-C","/tmp/3766994553","--log","--log","--trace_format","xml","--log-dir","/var/log/fdb","--timeout","10"]}
{"level":"error","ts":1666010935.6083379,"logger":"fdbclient","msg":"Error from FDB command","namespace":"testoperator2","cluster":"test-cluster","code":1,"stdout":"\nWARNING: Long delay (Ctrl-C to interrupt)\nSpecified timeout reached -- exiting...\n>>> option on ACCESS_SYSTEM_KEYS\nOption enabled for all transactions\n>>> get \\xff/coordinators\n\nThe database is unavailable; type `status' for more information.\n\n","stderr":"","error":"exit status 1","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).runCommandWithBackoff\n\t/workspace/fdbclient/admin_client.go:261\ngithub.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).GetConnectionString\n\t/workspace/fdbclient/admin_client.go:632\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.changeCoordinators.reconcile\n\t/workspace/controllers/change_coordinators.go:50\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
@johscheuer From the testing, if we delete more than half of all storage service ip, like two of three coordinator service ip, if we go into pod and run fdbcli status details, we'll know The database is unavailable, check apple controller log, it also indicate db is not available, sth like "error":"unable to fetch connection string: The database is unavailable; type status' for more information.\n"`, but check
oc get foundationdbcluster mdm-foundationdb-ibm -o yaml |grep health
health:
healthy: true
oc get foundationdbcluster mdm-foundationdb-ibm -o yaml |grep avai
available: true
why the status of foundationdbcluster status is not correct? when will it got updated to unavailable?
until I restart apple controller, the health is not true
Yes, I deleted 3 storage services which are all coordinator service. May I know, if a single service is deleted and could be recovered, any reason why multiple services couldn't do the similar? thanks.
Deleting a single service should be fine and the operator should recover that state, if that is not the case there might be a bug in the code. To be precise deleting all services for the coordinator Pods (in those case storage Pods) has the result that the operator is not being able anymore to connect to the database since the connection string will contain the old service IPs. In theory it might be possible that the operator could recover from this case but it's definitely not well tested.
why the status of foundationdbcluster status is not correct? when will it got updated to unavailable?
That's probably simply a bug :)
Feel free to look into those issues and create PRs with the respective fixes. We don't use the service implementation and the plan forward is to use DNS instead of service IPs.
Thank you @johscheuer for the explaination, I tried, if we delete one coordinator service ip, although one coordinator ip is not reachable, but fdb is still available, so it won't impact much. After deleteing the corresponding pod, all coordinators are reachable. if we delete more than half of the coordinator service ip, then I restart fdb cluster pods, then run fix-coordinator-ips command, it actually recovered. For foundationdbcluster status is not correct, since you mentioned we don't use service, so will use to check apple controller log to determine fdb is not available to workaround this issue. Closing this issue since I managed to make it work. thanks for all your help. @johscheuer
I believe it's still a bug that the coordinators are not updated when you delete a single service, so feel to reopen it and provide a fix for it.
@johscheuer If we want to fix it, it should be in the latest version, right? but you mentioned the plan forward is to use DNS instead of service IPs, this issue was found when using service IP, let me know if you still think we want to fix it even it's not strategy, I'll reopen it. thanks.