fdb-kubernetes-operator After service ip recreated, fdb.cluster is still using old service ip

After service ip recreated, fdb.cluster is still using old service ip

Open thorn126 opened this issue 3 years ago • 2 comments

What happened?

if using

    routing:
      publicIPSource: service

after svc got deleted, it will be recreated like:

oc get svc
NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
mdm-foundationdb-ibm-log-1         ClusterIP   172.30.1.229     <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-log-2         ClusterIP   172.30.197.98    <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-log-3         ClusterIP   172.30.13.176    <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-log-4         ClusterIP   172.30.96.172    <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-proxy-1       ClusterIP   172.30.214.129   <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-proxy-2       ClusterIP   172.30.215.209   <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-stateless-1   ClusterIP   172.30.216.201   <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-storage-1     ClusterIP   172.30.230.42    <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-storage-2     ClusterIP   172.30.226.243   <none>        4500/TCP,4501/TCP   4m33s
mdm-foundationdb-ibm-storage-3     ClusterIP   172.30.218.154   <none>        4500/TCP,4501/TCP   4m33s

oc delete svc mdm-foundationdb-ibm-storage-1 mdm-foundationdb-ibm-storage-2 mdm-foundationdb-ibm-storage-3
service "mdm-foundationdb-ibm-storage-1" deleted
service "mdm-foundationdb-ibm-storage-2" deleted
service "mdm-foundationdb-ibm-storage-3" deleted

check service again

 oc get svc
NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
mdm-foundationdb-ibm-log-1         ClusterIP   172.30.1.229     <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-log-2         ClusterIP   172.30.197.98    <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-log-3         ClusterIP   172.30.13.176    <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-log-4         ClusterIP   172.30.96.172    <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-proxy-1       ClusterIP   172.30.214.129   <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-proxy-2       ClusterIP   172.30.215.209   <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-stateless-1   ClusterIP   172.30.216.201   <none>        4500/TCP,4501/TCP   5m15s
mdm-foundationdb-ibm-storage-1     ClusterIP   172.30.104.200   <none>        4500/TCP,4501/TCP   4s
mdm-foundationdb-ibm-storage-2     ClusterIP   172.30.171.137   <none>        4500/TCP,4501/TCP   4s
mdm-foundationdb-ibm-storage-3     ClusterIP   172.30.240.39    <none>        4500/TCP,4501/TCP   4s

check fdb.cluster, it is still using old service ip

 cat /var/dynamic-conf/fdb.cluster
mdm_foundationdb_ibm:[email protected]:4500:tls,172.30.226.243:4500:tls,172.30.218.154:4500:tls

now connect fdbcluster will fail

oc exec -it mdm-foundationdb-ibm-storage-2 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "foundationdb" out of: foundationdb, foundationdb-kubernetes-sidecar, foundationdb-kubernetes-init (init)
bash-4.4$ fdbcli
Using cluster file `/var/dynamic-conf/fdb.cluster'.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status details

Using cluster file `/var/dynamic-conf/fdb.cluster'.

Could not communicate with a quorum of coordination servers:
  172.30.218.154:4500:tls  (unreachable)
  172.30.226.243:4500:tls  (unreachable)
  172.30.230.42:4500:tls  (unreachable)

after service ip changes, it should update fdb.cluster or fdb is not available although the status said healthy which is not correct.

 oc get foundationdbcluster mdm-foundationdb-ibm -o yaml |grep health
  health:
    healthy: true

What did you expect to happen?

fdb.cluster should be updated with updated service ips

How can we reproduce it (as minimally and precisely as possible)?

in cr

    routing:
      publicIPSource: service

delete services (delete coordinator services)

Anything else we need to know?

No response

FDB Kubernetes operator

$ kubectl fdb version
# paste output here

0.48.0

Kubernetes version

$ kubectl version
# paste output here
kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v0.21.0-beta.1", GitCommit:"88e7eba36ac7cc026d064bb6e4a988acc18a144e", GitTreeState:"clean", BuildDate:"2021-09-21T10:07:38Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5+3afdacb", GitCommit:"3c28e7a79b58e78b4c1dc1ab7e5f6c6c2d3aedd3", GitTreeState:"clean", BuildDate:"2022-05-10T16:30:48Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (0.21) and server (1.23) exceeds the supported minor version skew of +/-1

Cloud provider

OCP

Oct 17 '22 07:10 thorn126

Have you tried the same changes with a more recent operator version? Could you share the operator logs from the timespan you deleted the services?

Just to clarify you only deleted the service but you left the Pod in it's running state (so only the service IP was changed but no the Pod IP)? Is there an actual need for you to delete those services or is the concern to prevent issues if someone/something accidentally delete services?

Oct 17 '22 08:10 johscheuer

Thank you @johscheuer for your prompt response, I'll try to use the latest version and attach logs. Yes, when deleting services, pods are still running, didn't do other things. Chances are either service got deleted accidentally or the service is not stable, and somehow the service itself was lost.

Oct 17 '22 08:10 thorn126

I tried latest operator v1.9.0, still the same issue. applelog.zip let me if you need more info. @johscheuer

Oct 17 '22 09:10 thorn126

Just for my understanding was there only a single service that was recreated or all at once? If a single service is deleted the operator should be able to select a new coordinator or update the IP address. If multiple or all services are deleted this will get a little bit more complicated and the operator doesn't support auto recovery from there.

Oct 17 '22 09:10 johscheuer

Yes, I deleted 3 storage services which are all coordinator service. May I know, if a single service is deleted and could be recovered, any reason why multiple services couldn't do the similar? thanks.

Oct 17 '22 11:10 thorn126

Also from the controller log, it is complaining "Service \"test-cluster-cluster-controller-1\" not found", but the fact is storage services got recreated. this msg sounds confusing. {"level":"error","ts":1665997062.672065,"logger":"controller","msg":"Error in reconciliation","namespace":"testoperator2","cluster":"test-cluster","subReconciler":"controllers.addPods","requeueAfter":0,"error":"Service \"test-cluster-cluster-controller-1\" not found","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:183\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

Oct 17 '22 12:10 thorn126

NAME test-cluster-cluster-controller-2 test-cluster-log-5 test-cluster-log-6 test-cluster-log-7 test-cluster-log-8 test-cluster-storage-4 test-cluster-storage-5 test-cluster-storage-6 oc get po NAME fdb-kubernetes-operato test-cluster-cluster-controller-2 test-cluster-log-5 test-cluster-log-6 test-cluster-log-7 test-cluster-log-8 test-cluster-storage-4 test-cluster-storage-5 test-cluster-storage-6

test-cluster-

 oc get svc TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE ClusterIP   172.30.152.169   <none>        4500/TCP,4501/TCP   3h29m ClusterIP   172.30.13.115    <none>        4500/TCP,4501/TCP   3h29m ClusterIP   172.30.104.156   <none>        4500/TCP,4501/TCP   3h29m ClusterIP   172.30.174.62    <none>        4500/TCP,4501/TCP   3h29m ClusterIP   172.30.56.238    <none>        4500/TCP,4501/TCP   3h29m ClusterIP   172.30.151.4     <none>        4500/TCP,4501/TCP   3h27m ClusterIP   172.30.105.22    <none>        4500/TCP,4501/TCP   3h27m ClusterIP   172.30.189.231   <none>        4500/TCP,4501/TCP   3h27m READY   STATUS    RESTARTS   AGE r-controller-manager-d5d6ccd5c-kjhwn   1/1     Running   0          3h30m 2/2     Running   0          3h30m 2/2     Running   0          3h30m 2/2     Running   0          3h30m 2/2     Running   0          3h30m 2/2     Running   0          3h30m 2/2     Running   0          3h30m 2/2     Running   0          3h30m 2/2     Running   0          3h30m cluster-controller-1 doesn't exist though, it is test-cluster-cluster-controller-2 now, but operator still trying to get the old svc. (btw, I didn't delete cluster controller's service, it is operator itself)

                                    
                                    
                                        
                                            
                                            
                                                Oct 17
                                                '22  12:10
                                            
                                            
                                                thorn126


                                                                    
                                    
                                        
Actually I tried more testing, even only delete one service, it won't recover
oc get svc
NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
test-cluster-cluster-controller-2   ClusterIP   172.30.182.87    <none>        4500/TCP,4501/TCP   8m5s
test-cluster-log-5                  ClusterIP   172.30.229.194   <none>        4500/TCP,4501/TCP   8m5s
test-cluster-log-6                  ClusterIP   172.30.245.142   <none>        4500/TCP,4501/TCP   8m5s
test-cluster-log-7                  ClusterIP   172.30.255.83    <none>        4500/TCP,4501/TCP   8m5s
test-cluster-log-8                  ClusterIP   172.30.90.51     <none>        4500/TCP,4501/TCP   8m5s
test-cluster-storage-4              ClusterIP   172.30.45.139    <none>        4500/TCP,4501/TCP   8m5s
test-cluster-storage-5              ClusterIP   172.30.181.57    <none>        4500/TCP,4501/TCP   8m5s
test-cluster-storage-6              ClusterIP   172.30.183.175   <none>        4500/TCP,4501/TCP   8m5s

oc get foundationdbcluster test-cluster -o yaml |grep conn
  connectionString: test_cluster:[email protected]:4501,172.30.229.194:4501,172.30.245.142:4501

delete one coordinator service
oc  delete svc test-cluster-log-8
service "test-cluster-log-8" deleted
# oc get svc
NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
test-cluster-cluster-controller-2   ClusterIP   172.30.182.87    <none>        4500/TCP,4501/TCP   9m40s
test-cluster-log-5                  ClusterIP   172.30.229.194   <none>        4500/TCP,4501/TCP   9m40s
test-cluster-log-6                  ClusterIP   172.30.245.142   <none>        4500/TCP,4501/TCP   9m40s
test-cluster-log-7                  ClusterIP   172.30.255.83    <none>        4500/TCP,4501/TCP   9m40s
test-cluster-log-8                  ClusterIP   172.30.73.139    <none>        4500/TCP,4501/TCP   13s
test-cluster-storage-4              ClusterIP   172.30.45.139    <none>        4500/TCP,4501/TCP   9m40s
test-cluster-storage-5              ClusterIP   172.30.181.57    <none>        4500/TCP,4501/TCP   9m40s
test-cluster-storage-6              ClusterIP   172.30.183.175   <none>        4500/TCP,4501/TCP   9m40s

check connectionstring
  connectionString: test_cluster:[email protected]:4501,172.30.229.194:4501,172.30.245.142:4501

it is still the old service ip.
nothing indicate anything change from controller log
{"level":"info","ts":1666010517.5915344,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"testoperator2","cluster":"test-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1666010517.6078048,"logger":"controller","msg":"Reconciliation complete","namespace":"testoperator2","cluster":"test-cluster","generation":1}


if I delete all 3 coordinator services
oc delete svc test-cluster-log-5 test-cluster-log-6
service "test-cluster-log-5" deleted
service "test-cluster-log-6" deleted
# oc get svc
NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
test-cluster-cluster-controller-2   ClusterIP   172.30.182.87    <none>        4500/TCP,4501/TCP   16m
test-cluster-log-5                  ClusterIP   172.30.84.107    <none>        4500/TCP,4501/TCP   9s
test-cluster-log-7                  ClusterIP   172.30.255.83    <none>        4500/TCP,4501/TCP   16m
test-cluster-log-8                  ClusterIP   172.30.73.139    <none>        4500/TCP,4501/TCP   7m
test-cluster-storage-4              ClusterIP   172.30.45.139    <none>        4500/TCP,4501/TCP   16m
test-cluster-storage-5              ClusterIP   172.30.181.57    <none>        4500/TCP,4501/TCP   16m
test-cluster-storage-6              ClusterIP   172.30.183.175   <none>        4500/TCP,4501/TCP   16m

check controller log, this time got timeout issue which indicate fdb is not available
{"level":"info","ts":1666010925.5818737,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"testoperator2","cluster":"test-cluster","subReconciler":"controllers.excludeProcesses"}
{"level":"info","ts":1666010925.5821764,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"testoperator2","cluster":"test-cluster","subReconciler":"controllers.changeCoordinators"}
{"level":"info","ts":1666010925.582316,"logger":"fdbclient","msg":"Running command","namespace":"testoperator2","cluster":"test-cluster","path":"/usr/bin/fdb/6.3/fdbcli","args":["/usr/bin/fdb/6.3/fdbcli","--exec","option on ACCESS_SYSTEM_KEYS; get \ufffd/coordinators","-C","/tmp/3766994553","--log","--log","--trace_format","xml","--log-dir","/var/log/fdb","--timeout","10"]}
{"level":"error","ts":1666010935.6083379,"logger":"fdbclient","msg":"Error from FDB command","namespace":"testoperator2","cluster":"test-cluster","code":1,"stdout":"\nWARNING: Long delay (Ctrl-C to interrupt)\nSpecified timeout reached -- exiting...\n>>> option on ACCESS_SYSTEM_KEYS\nOption enabled for all transactions\n>>> get \\xff/coordinators\n\nThe database is unavailable; type `status' for more information.\n\n","stderr":"","error":"exit status 1","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).runCommandWithBackoff\n\t/workspace/fdbclient/admin_client.go:261\ngithub.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).GetConnectionString\n\t/workspace/fdbclient/admin_client.go:632\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.changeCoordinators.reconcile\n\t/workspace/controllers/change_coordinators.go:50\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}



                                    
                                    
                                        
                                            
                                            
                                                Oct 17
                                                '22  12:10
                                            
                                            
                                                thorn126
                                            
                                        
                                    
                                    
                                
                                                                    
                                    
                                        
@johscheuer From the testing, if we delete more than half of all storage service ip, like two of three coordinator service ip, if we go into pod and run fdbcli status details, we'll know The database is unavailable, check apple controller log, it also indicate db is not available, sth like "error":"unable to fetch connection string: The database is unavailable; type status' for more information.\n"`, but check
oc get foundationdbcluster mdm-foundationdb-ibm -o yaml |grep health
  health:
    healthy: true

oc get foundationdbcluster mdm-foundationdb-ibm -o yaml |grep avai
    available: true

why the status of foundationdbcluster status is not correct? when will it got updated to unavailable?

                                    
                                    
                                        
                                            
                                            
                                                Oct 18
                                                '22  14:10
                                            
                                            
                                                thorn126
                                            
                                        
                                    
                                    
                                
                                                                    
                                    
                                        
until I restart apple controller, the health is not true

                                    
                                    
                                        
                                            
                                            
                                                Oct 18
                                                '22  14:10
                                            
                                            
                                                thorn126
                                            
                                        
                                    
                                    
                                
                                                                    
                                    
                                        

Yes, I deleted 3 storage services which are all coordinator service. May I know, if a single service is deleted and could be recovered, any reason why multiple services couldn't do the similar? thanks.

Deleting a single service should be fine and the operator should recover that state, if that is not the case there might be a bug in the code. To be precise deleting all services for the coordinator Pods (in those case storage Pods) has the result that the operator is not being able anymore to connect to the database since the connection string will contain the old service IPs. In theory it might be possible that the operator could recover from this case but it's definitely not well tested.

why the status of foundationdbcluster status is not correct? when will it got updated to unavailable?

That's probably simply a bug :)
Feel free to look into those issues and create PRs with the respective fixes. We don't use the service implementation and the plan forward is to use DNS instead of service IPs.

                                    
                                    
                                        
                                            
                                            
                                                Oct 19
                                                '22  07:10
                                            
                                            
                                                johscheuer
                                            
                                        
                                    
                                    
                                
                                                                    
                                    
                                        
Thank you @johscheuer for the explaination, I tried, if we delete one coordinator service ip, although one coordinator ip is not reachable, but fdb is still available, so it won't impact much. After deleteing the corresponding pod, all coordinators are reachable. if we delete more than half of the coordinator service ip, then I restart fdb cluster pods, then run fix-coordinator-ips command, it actually recovered.
For foundationdbcluster status is not correct, since you mentioned we don't use service, so will use to check apple controller log to determine fdb is not available to workaround this issue.
Closing this issue since I managed to make it work. thanks for all your help. @johscheuer

                                    
                                    
                                        
                                            
                                            
                                                Oct 20
                                                '22  13:10
                                            
                                            
                                                thorn126
                                            
                                        
                                    
                                    
                                
                                                                    
                                    
                                        
I believe it's still a bug that the coordinators are not updated when you delete a single service, so feel to reopen it and provide a fix for it.

                                    
                                    
                                        
                                            
                                            
                                                Oct 21
                                                '22  13:10
                                            
                                            
                                                johscheuer
                                            
                                        
                                    
                                    
                                
                                                                    
                                    
                                        
@johscheuer If we want to fix it, it should be in the latest version, right? but you mentioned the plan forward is to use DNS instead of service IPs, this issue was found when using service IP, let me know if you still think we want to fix it even it's not strategy, I'll reopen it. thanks.

                                    
                                    
                                        
                                            
                                            
                                                Oct 24
                                                '22  00:10
                                            
                                            
                                                thorn126

fdb-kubernetes-operator fdb-kubernetes-operator copied to clipboard

After service ip recreated, fdb.cluster is still using old service ip

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

FDB Kubernetes operator

Kubernetes version

Cloud provider

fdb-kubernetes-operator
fdb-kubernetes-operator copied to clipboard