etcd-operator icon indicating copy to clipboard operation
etcd-operator copied to clipboard

EtcdBackup unable to create etcd endpoint.

Open shebanian opened this issue 5 years ago • 12 comments

I am running a working etcd-cluster with vault on it. The vault is working correctly and so is the cluster itself.

Name:         vault-etcd
Namespace:    my-namespace
Labels:       app=vault
              vault_cr=vault
Annotations:  <none>
API Version:  etcd.database.coreos.com/v1beta2
Kind:         EtcdCluster
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-08-29T14:04:48Z
  Generation:          0
  Owner References:
    API Version:     vault.banzaicloud.com/v1alpha1
    Controller:      true
    Kind:            Vault
    Name:            vault
    UID:             7c63314b-ab94-11e8-bd5c-0626c6bac6fc
  Resource Version:  4246480
  Self Link:         /apis/etcd.database.coreos.com/v1beta2/namespaces/my-namespace/etcdclusters/vault-etcd
  UID:               7d891555-ab94-11e8-bd5c-0626c6bac6fc
Spec:
  TLS:
    Static:
      Member:
        Peer Secret:    vault-etcd-tls
        Server Secret:  vault-etcd-tls
      Operator Secret:  vault-etcd-tls
  Repository:           quay.io/coreos/etcd
  Size:                 3
  Version:              3.1.15
Status:
  Client Port:  2379
  Conditions:
    Last Transition Time:  2018-08-29T14:05:31Z
    Last Update Time:      2018-08-29T14:05:31Z
    Reason:                Cluster available
    Status:                True
    Type:                  Available
  Current Version:         3.1.15
  Members:
    Ready:
      vault-etcd-475x979hr9
      vault-etcd-fqsvhxrhl4
      vault-etcd-lrdzr5gsqn
  Phase:           Running
  Service Name:    vault-etcd-client
  Size:            3
  Target Version: 
Events:       <none>

I have created an EtcdBackup to make a backup to an S3 bucket, but it keeps failing. And i can't find out why. KubeDNS is working and the endpoint is correct.

Name:         backup-vault-etcd-20180828-1045
Namespace:    my-namespace
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"etcd.database.coreos.com/v1beta2","kind":"EtcdBackup","metadata":{"annotations":{},"name":"backup-vault-etcd-20180828-1045","namespace":...
API Version:  etcd.database.coreos.com/v1beta2
Kind:         EtcdBackup
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-08-30T09:27:39Z
  Generation:          0
  Resource Version:    4244170
  Self Link:           /apis/etcd.database.coreos.com/v1beta2/namespaces/my-namespace/etcdbackups/backup-vault-etcd-20180828-1045
  UID:                 f071b973-ac36-11e8-b2fa-0626c6bac6fc
Spec:
  Client TLS Secret:  vault-etcd-tls
  Etcd Endpoints:
    https://vault-etcd-client:2379
  S 3:
    Aws Secret:  etcd-operator
    Path:        etcd-backups/vault-etcd-20180828-1045
  Storage Type:  S3
Status:
  Reason:     failed to save snapshot (create etcd client failed: failed to get etcd client with maximum kv store revision: could not create an etcd client for the max revision purpose from given endpoints ([https://vault-etcd-client:2379]))
  Succeeded:  false
Events:       <none>

The k8s secret vault-etcd-tls contains everything needed.

Name:         vault-etcd-tls
Namespace:    my-namespace
Labels:       app=vault
              vault_cr=vault-etcd
Annotations:  <none>

Type:  Opaque

Data
====
peer.crt:            1342 bytes
peer.key:            1675 bytes
server.crt:          1330 bytes
server.key:          1679 bytes
etcd-client.crt:     1131 bytes
peer-ca.crt:         1143 bytes
server-ca.crt:       1143 bytes
etcd-client-ca.crt:  1143 bytes
etcd-client.key:     1679 bytes

I think the fact the backup-operator keeps failing is a bug, because i can't find a configuration mistake.

Looking true the error logs and then at the code i have found at least an optimization for the error logging.

	if maxClient == nil {
		return nil, 0, fmt.Errorf("could not create an etcd client for the max revision purpose from given endpoints (%v)", endpoints)
	}

	var err error
	if len(errors) > 0 {
		errorStr := ""
		for _, errStr := range errors {
			errorStr += errStr + "\n"
		}
		err = fmt.Errorf(errorStr)
}

This should be changed because the specific error, failed to create or failed to revision are not printed to the log when no Client could be created. I think the error logging should be moved before the maxClient == nil check.

So i don't know if it fails to to get revision from endpoint or failt to create a etcd-client.

shebanian avatar Aug 30 '18 14:08 shebanian

I still have this problem. Any thoughts?

shebanian avatar Sep 11 '18 09:09 shebanian

I also have same issue:

time="2018-09-12T13:12:45Z" level=error msg="error syncing etcd backup (vault/etcd-cluster): failed to save snapshot (create etcd client failed: failed to get etcd client with maximum kv store revision: could not create an etcd client for the max revision purpose from given endpoints ([https://etcd-cluster-client.vault:2379]))" pkg=controller

salkin avatar Sep 12 '18 13:09 salkin

@shebanian in my case I could solve the issue by using .svc in the etcd cluster endpoint url:

This one generates the fault: https://etcd-cluster-client.vault:2379

But when changing the endpoint to https://etcd-cluster-client.vault.svc:2379 the backup is successfully save.

salkin avatar Sep 13 '18 05:09 salkin

@salkin I have tried you're solution but it doesn't help. I'm still keep getting the same error.

shebanian avatar Sep 14 '18 08:09 shebanian

My 2 cents, seems to work on the clusters created in the NS by etcd-operator, doesn't seem to work on my external clusters, I gave my backupCR the endpoints, the AWS stuff, but it still won't get the backup for me. Though debugging is much harder simply because I don't know what is actually failing, keep getting that error in my logs too.

Side note: can anyone show me how to use etcd-backup-operator to backup external clusters? tyvm

erasmus74 avatar Oct 05 '18 18:10 erasmus74

Hi, I have also faced the similar issue. It seems you have problem to connect the etcd. You need to configure the etcd certificate with the same name what they have suggested.

kannanvr avatar Dec 19 '18 06:12 kannanvr

Hi Guys, I've found this issue also and the solution to fix this issue is add

ClientTLSSecret: vault-cluster-etcd-client-tls

The ClientTLSSecret value must be exactly match with secret name as show in kubectl get secret

apiVersion: "etcd.database.coreos.com/v1beta2" kind: "EtcdBackup" metadata: name: etcd-cluster-backup spec: etcdEndpoints: ["https://vault-cluster-etcd-client:2379"] ClientTLSSecret: vault-cluster-etcd-client-tls storageType: S3 s3: path: vault-backup-bucket/TH/MGT/Openshift/non-production-cluster-1.bkp awsSecret: aws

Enjoy!

selfieblue avatar Dec 22 '18 07:12 selfieblue

I have this error also, without using SSL (for now):

  "Reason": "failed to save snapshot (create etcd client failed: failed to get etcd client with maximum kv store revision: could not create an etcd client for the max revision purpose from given endpoints ([http://vault-etcd-cluster-client.secrets:2379]))",
  "etcdRevision": 2822,
  "etcdVersion": "3.3.12",
  "lastSuccessDate": "2019-03-21T22:46:39Z",
  "succeeded": false

But I am still getting backup files in the bucket and I see the following log in the operator console:

amazing-dog-etcd-operator-etcd-backup-operator-5c5fbdbcb8-968zr etcd-backup-operator 2019-03-21T22:46:39.208425881Z time="2019-03-21T22:46:39Z" level=info msg="getMaxRev: endpoint http://vault-etcd-cluster-client.secrets:2379 revision (2822)"

jurgenweber avatar Mar 21 '19 23:03 jurgenweber

Seeing the same issue with backup operator 0.9.4:

  • no TLS
  • operator and cluster reside in separate namespaces
  • backupCR is configured in operator's namespaces and connects to service fqdn http://test.etcd.svc.cluster.local:2379

seconds after the following log line is in the logs: time="2019-05-02T08:56:54Z" level=info msg="getMaxRev: endpoint http://test-client.etcd.svc.cluster.local:2379 revision (4)" which indicates that connection was successful, I can see for a brief moment that backup was successful (also S3 bucket gets updated). After couple of seconds (significantly less then backup interval) though backup status is changed to failed with the following reason:

failed to save snapshot (create etcd client failed: failed to get etcd client with maximum kv store revision: could not create an etcd client for the max revision purpose from given endpoints ([http://test-client.etcd.svc.cluster.local:2379]))

Restart of the backup operator pod fixes the issue.

The steps to reproduce this behaviour are seem to be these:

  • Configure wrong s3 credentials and wait for backup to fail
  • Restore correct configuration and use something like watch -n 1 -d kubectl describe etcdbackup to observe faulty behaviour.
  • Restart backup-operator pod with kubectl delete pod and watch issue disappear

alex-goncharov avatar May 02 '19 09:05 alex-goncharov

how to use etcd-backup-operator to backup external clusters

Did you fix it? I have the same scene as you.

zhangsimingshannonai avatar Sep 20 '19 10:09 zhangsimingshannonai

I can confirm that @selfieblue 's solution works, although the name he provided is wrong.

Use this etcd-backup.yml as a reference:

apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdBackup"
metadata:
  name: gcs-vault-backup
spec:
  etcdEndpoints:
    - https://vault-etcd-client:2379
  clientTLSSecret: vault-etcd-client-tls
  storageType: GCS
  backupPolicy:
    backupIntervalInSecond: 3600
    maxBackups: 48
  gcs:
    path: my-bucket-name/vault.backup
    gcpSecret: gcs-vault-credentials

Adding

clientTLSSecret: vault-etcd-client-tls

fixes the issue. Make sure to use the tls secret of your etcd client (kubectl get secrets | grep client-tls) and try again. A successful backup will result in the following:

$ kubectl logs -f deployments/etcd-operator etcd-backup-operator
time="2019-10-10T07:20:31Z" level=info msg="getMaxRev: endpoint https://vault-etcd-client:2379 revision (1978)"

Note that if you've done everything correctly you should see the Client TLS Secret reference appearing when you execute this command:

$ kubectl describe etcdbackups.etcd.database.coreos.com gcs-vault-backup

Name:         gcs-vault-backup
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"etcd.database.coreos.com/v1beta2","kind":"EtcdBackup","metadata":{"annotations":{},"name":"gcs-vault-backup","namespa...
API Version:  etcd.database.coreos.com/v1beta2
Kind:         EtcdBackup
Metadata:
  Creation Timestamp:  2019-10-10T07:19:02Z
  Finalizers:
    backup-operator-periodic
  Generation:        33
  Resource Version:  189086934
  Self Link:         /apis/etcd.database.coreos.com/v1beta2/namespaces/default/etcdbackups/gcs-vault-backup
  UID:               158ff07d-86e2-4cfa-b7ac-618d25662cf7
Spec:
  Backup Policy:
    Backup Interval In Second:  3600
    Max Backups:                48
  Client TLS Secret:            vault-etcd-client-tls
  Etcd Endpoints:
    https://vault-etcd-client:2379
  Gcs:
    Gcp Secret:  gcs-vault-credentials
    Path:        my-bucket-name/vault.backup
  Storage Type:  GCS

denysvitali avatar Oct 10 '19 07:10 denysvitali

i was also able to remedy this issue by following @alex-goncharov 's recommendation. all my configs were correct, though i had been monkeying around with the aws secrets for a bit before getting them right. with everything back in alignment, it still kept failing, but turns out i needed to delete-restart the etcd-operator-etcd-backup pod (which spawns a new one via the deployment) and then delete and recreate the etcdbackup custom resource. doing those two things was all i needed to get it working again. thanks for the pro tip alex.

johndietz avatar Feb 21 '20 17:02 johndietz