fdb-kubernetes-operator
fdb-kubernetes-operator copied to clipboard
FoundationDBBackup does not reconcile or start backup
What happened?
We have configured an FDB cluster using the v1.7.0 operator, this is operating correctly and can be interacted with. The cluster has reconcilled correctly. We have now moved to configure backups, however upon creating the fdbbackup object the operator never finishes reconcilling and the backup is never started.
# kubectl get fdbbackup -n fdb-primary-dev primary-dev-backup 08:58:03
NAME GENERATION RECONCILED AGE
primary-dev-backup 1 23m
The operator does attempt to create a backup.
{"level":"info","ts":1657699169.9587805,"logger":"fdbclient","msg":"Running command","namespace":"fdb-primary-dev","cluster":"primary-dev","path":"/usr/bin/fdb/7.1/fdbbackup","args":["/usr/bin/fdb/7.1/fdbbackup","status","--json","-C","/tmp/1088749738","--log","--log","--logdir","/var/log/fdb"]}
{"level":"info","ts":1657699170.037034,"logger":"fdbclient","msg":"Command completed","namespace":"fdb-primary-dev","cluster":"primary-dev","output":"{\"SchemaVersion\":\"1...."}
{"level":"info","ts":1657699170.0374086,"logger":"fdbclient","msg":"Running command","namespace":"fdb-primary-dev","cluster":"primary-dev","path":"/usr/bin/fdb/7.1/fdbbackup","args":["/usr/bin/fdb/7.1/fdbbackup","start","-d","blobstore://[email protected]/primary-dev-20220713?bucket=foundationdb-backups-dev-XXXXXXXXXX-XXXXXXXXXXX","-s","1200","-z","-C","/tmp/2189325739","--log","--log","--logdir","/var/log/fdb"]}
{"level":"error","ts":1657699180.0770085,"logger":"fdbclient","msg":"Error from FDB command","namespace":"fdb-primary-dev","cluster":"primary-dev","code":-1,"stdout":"","stderr":"","error":"signal: killed","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).StartBackup\n\t/workspace/fdbclient/admin_client.go:595\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.startBackup.reconcile\n\t/workspace/controllers/start_backup.go:45\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBBackupReconciler).Reconcile\n\t/workspace/controllers/backup_controller.go:88\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:214"}
{"level":"error","ts":1657699180.0771945,"logger":"controller","msg":"Error in reconciliation","namespace":"fdb-primary-dev","backup":"primary-dev-backup","subReconciler":"controllers.startBackup","requeueAfter":0,"error":"signal: killed","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBBackupReconciler).Reconcile\n\t/workspace/controllers/backup_controller.go:93\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:214"}
{"level":"error","ts":1657699180.0772355,"logger":"controller-runtime.manager.controller.foundationdbbackup","msg":"Reconciler error","reconciler group":"apps.foundationdb.org","reconciler kind":"FoundationDBBackup","name":"primary-dev-backup","namespace":"fdb-primary-dev","error":"signal: killed","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:214"}
If I connect to operator and look for the appropriate logs of that execution, the first obvious error I find is:
<Event Severity="20" Time="1657698700.094137" DateTime="2022-07-13T07:51:40Z" Type="S3BlobStoreEndpointRequestFailedRetryable" ID="0000000000000000" Error="lookup_failed" ErrorDescription="DNS lookup failed" ErrorCode="1041" SuppressedEventCount="0" ConnectionEstablished="0" RemoteHost="s3.eu-west-1.amazonaws.com" Verb="HEAD" Resource="/foundationdb-backups-dev-XXXXXXXXXXX-XXXXXXXXXX" ThisTry="1" RetryDelay="2" ThreadID="3291365605512418571" Machine="10.0.134.86:756" LogGroup="default" ClientDescription="primary-7.1.5-3291365605512418571" />
The only reference to a DNS lookup failed I can find leads me to this forum post on the fdb forum. https://forums.foundationdb.org/t/foundationdb-backup-s3-https-connection-failed/2340
Which suggests it might be a TLS configuration failure - we provide the fdbbackup k8s object with a valid CA chain for aws s3 and the client certificate, that's also provided with its key - though no TLS is required on the cluster at this point.
We can see in the fdb operator logs that the TLS configuration is not provided to fdbbackup.
<Event Severity="10" Time="1657698700.064259" DateTime="2022-07-13T07:51:40Z" Type="Net2TLSConfig" ID="0000000000000000" CAPath="" CertificatePath="" KeyPath="" HasPassword="0" VerifyPeers="Check.Valid=1" ThreadID="3291365605512418571" Machine="10.0.134.86:756" LogGroup="default" ClientDescription="primary-7.1.5-3291365605512418571" />
This might be unrelated though.
If I reconsturct the start command and run it on one of the two backup agent nodes (rather than on the operator), the backup is issued and starts successfully.
fdbbackup start -d blobstore://[email protected]/primary-dev-20220712?bucket=foundationdb-backups-dev-XXXXXXXXXXX-XXXXXXXXX -s 1200 -z --log --log --logdir .
The operator then reconcils and completes.
What did you expect to happen?
The operator to create the backup and reconcil.
How can we reproduce it (as minimally and precisely as possible)?
- Install v1.7.0 fdb-operator, in global mode.
- Create a cluster in its own namespace.
- Wait for reconcilliation.
- Create a backup in the same namespace as the cluster.
Anything else we need to know?
I've attached the following logs and k8s objects.
fdb-operator-new-fdbbackup-logs.txt
fdbackup-start-on-fdb-operator-failure.txt
fdbbackup-start-on-backup-agent-success.txt
FDB Kubernetes operator
$ kubectl fdb version
# paste output here
The above does not work for me, I don't know if I've missed a stage during installation for that to work.
We install the operator using CDK8s charts, from tag v1.7.0.
Operator is running the current image: foundationdb/fdb-kubernetes-operator:v1.7.0
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:36:49Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.12-eks-a64ea69", GitCommit:"d4336843ba36120e9ed1491fddff5f2fec33eb77", GitTreeState:"clean", BuildDate:"2022-05-12T18:29:27Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
I think those are related: https://github.com/FoundationDB/fdb-kubernetes-operator/issues/1240
Could you try adding the port to your configuration and see if that works.
Thanks for the quick feedback @johscheuer! Adding :443 has certainly pushed us forward, it has however exposed a more obvious TLS error.
<Event Severity="20" Time="1657724985.453345" DateTime="2022-07-13T15:09:45Z" Type="TLSPolicyFailure" ID="0000000000000000" SuppressedEventCount="0" Reason="preverification failed" VerifyError="unable to get local issuer certificate" ThreadID="13551886713154994050" Machine="10.0.134.86:1258" LogGroup="default" ClientDescription="primary-7.1.5-13551886713154994050" />
<Event Severity="20" Time="1657724985.453345" DateTime="2022-07-13T15:09:45Z" Type="N2_ConnectHandshakeError" ID="61d44397d998f368" SuppressedEventCount="0" ErrorCode="337047686" Message="certificate verify failed (SSL routines, tls_process_server_certificate)" WhichMeans="error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed" ThreadID="13551886713154994050" Machine="10.0.134.86:1258" LogGroup="default" ClientDescription="primary-7.1.5-13551886713154994050" />
<Event Severity="20" Time="1657724985.453345" DateTime="2022-07-13T15:09:45Z" Type="S3BlobStoreEndpointRequestFailedRetryable" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" ConnectionEstablished="0" RemoteHost="s3.eu-west-1.amazonaws.com" Verb="HEAD" Resource="/foundationdb-backups-dev-storage8d9329be-vb9bge2m6r1h" ThisTry="1" RetryDelay="2" ThreadID="13551886713154994050" Machine="10.0.134.86:1258" LogGroup="default" ClientDescription="primary-7.1.5-13551886713154994050" />
From further up in the log for the fdbbackup command running on the operator:
<Event Severity="10" Time="1657724985.381516" DateTime="2022-07-13T15:09:45Z" Type="Net2TLSConfig" ID="0000000000000000" CAPath="" CertificatePath="" KeyPath="" HasPassword="0" VerifyPeers="Check.Valid=1" ThreadID="13551886713154994050" Machine="10.0.134.86:1258" LogGroup="default" ClientDescription="primary-7.1.5-13551886713154994050" />
I'm assuming that the CA chain provided to the FoundationDBBackup podTemplate is unavailable on the operator pod at this point? If fdbbackup needs access to the tls certs to start the process, do we need to provide the certs to the operator, or does the operator need to execute fdbbackup start on the backup agents?
Full log if needed: fdbbackup-start-on-fdb-operator-with-443-failed.txt
Currently you have to provide those certs to the operator we have an issue open to restructure that see: https://github.com/FoundationDB/fdb-kubernetes-operator/issues/753 but we currently don't have the time to implement this.
I'll go ahead and close this issue, since we have follow up radars for all the work referenced here. If you still have any issues please let me know and feel free to reopen the issue or create a new issue.
Thanks for reporting the issue!