easegress icon indicating copy to clipboard operation
easegress copied to clipboard

[question] backup and recovery

Open brzyangg opened this issue 3 years ago • 10 comments

Is there a backup and recovery plan? How can I save my cluster data and recover it quickly in the event of a failure.

brzyangg avatar Jan 04 '22 08:01 brzyangg

By default, easegress will store cluster data into etcd in your local disk. When restart, easegress will autoload these stored data (yaml files) and create corresponding object (http server, http pipelines) for you. Is this what you mean?

suchen-sci avatar Jan 04 '22 09:01 suchen-sci

What I mean is that the data on my local disk may be lost due to some unpredictable problems. I hope to save a snapshot of the data regularly. So that you can recover as soon as possible in the event of data loss.

brzyangg avatar Jan 04 '22 12:01 brzyangg

@brzyangg Thank you for the question. Here's briefly the instructions for backup and recovery:

Easegress yaml configurations

When developing with Easegress, the best practice is to store all yaml configurations to VCS such as git. Especially important is to add the Easegress server configuration to git, as you will need it when recovering from backup. Please note that if you start Easegress with default parameters (e.g. ./easegress-server with no flags), you don't have configuration to store. Still it's best to add all object definitions (HTTPServer, HTTPPipeline, filters etc) to git to ensure the best developer experience.

When object definition YAML files are stored elsewhere (e.g. git), the disaster recovery becomes very simple. Just start Easegress (single node or cluster) with fresh disk (data-dir and member-dir are empty) and re-create all objects using egctl object create -f easegress-objects.yaml where easegress-objects.yaml contains all objects.

If you are using AutoCertManager to update automatically TLS certificates using Let´s Encrypt, AutoCertManager will renew the certificates after disaster and restarting Easegress. You don't need to backup these certificates elsewhere.

In most of the cases recreating Easegress without data and recreating the objects is sufficient for disaster recovery. However if you still need to create backups of the data and use those backups to recover after disaster, here are the different cases to consider.

Backup single node Easegress

You can backup periodically the data-dir folder that Easegress creates at startup. The default path is ./data but you might have different value in your Easegress configuration.

Recover single node Easegress

Copy backup data-dir directory to suitable location on your node and start Easegress server with the same configuration, as initial instance. Ensure that the data-dir (by default ./data) points to restored backup folder.

Backup multi-node Easegress

If you have multiple Easegress nodes running in a cluster mode (either cluster-join-urls or cluster.initial-cluster defined in Easegress configuration), there are two variant to consider:

Using cluster-join-urls

Easegress v1.4.0 introduced new simplified way to configure and manage cluster using cluster.initial-cluster YAML entry. Please consider updating your configuration to this new way, following this cookbook. If you don't want to update right now, here's how you should backup your Easegress cluster: Save periodically backup of data-dir (default value ./data) and member-dir (default value ./member) of each node. So for cluster of size N you need to backup N * 2 folders.

Using cluster.initial-cluster

You only need to backup the data-dir (default value ./data). You can export it periodically from one node, from each node or you can alternate the node to backup each time, for example.

Restore multi-node Easegress

Let's consider again the two variants, depending if you are using cluster-join-urls of cluster.initial-cluster in your Easegress configuration.

Using cluster-join-urls

Restore the corresponding data-dir and member-dir of each node and restart Easegress with initial configurations. Please ensure that the restored data-dir and member-dir are in the locations defined in your configuration.

Using cluster.initial-cluster

Restore the latest data-dir to only one node and restart Easegress on this node using easegress-server -f config.yaml --state-flag "existing" where config.yaml is the initial configuration. Please ensure that the restored data-dir is in the location defined in your configuration. For the rest of the nodes, ensure that data-dir location does not exist and restart Easegress using easegress-server -f config.yaml --state-flag "existing". Easegress state is synchronized automatically to all nodes.

Note on copying data-dir

When copying data-dir and member-dir, Easegress might be currently writing to these files. Copy operation, while Easegress is running, might cause data corruption. To avoid this issue, you can first shutdown Easegress process of the node you want to backup and then copy the data folder. With 3 nodes or more (odd number), it is safe to shutdown Easegress on one node before backup and restart after backup is done.

samutamm avatar Jan 05 '22 00:01 samutamm

@samutamm Thank you so much. I try this way.

brzyangg avatar Jan 05 '22 03:01 brzyangg

@brzyangg Have you had chance to try these instructions in your environment?

samutamm avatar Jan 14 '22 06:01 samutamm

@samutamm Hi, thank you for your support, I have some problems in my environment. My easegress using cluster-join-urls and deploy in kubernetes. It is a statefulset and has three pods. I want to simulate a failure and recover.

First, I copied and saved the data-dir and member-dir of three pods. Delete the data-dir and member-dir of all pods. The cluster is currently unavailable. I copied the data-dir and member-dir in as you suggested. After these operations, the cluster cannot recover.

This is the log of my node

log for pod-0

2022-01-18T15:39:38.932+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:39:38.933+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:39:38.933+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:39:38.934+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:40:38.935+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:40:38.935+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:40:38.94+08:00 ERROR cluster/cluster.go:197 start cluster failed (1 retries): start server failed: wal: crc mismatch 2022-01-18T15:41:43.941+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:41:43.941+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:41:43.944+08:00 ERROR cluster/cluster.go:203 failed start many times(2), start others if they're not online, otherwise purge this member, clean data directory and rejoin it back. 2022-01-18T15:42:40.547+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:42:40.547+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:42:40.548+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:42:40.548+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:43:40.549+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:43:40.549+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:43:40.553+08:00 ERROR cluster/cluster.go:197 start cluster failed (1 retries): start server failed: wal: crc mismatch 2022-01-18T15:44:45.554+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:44:45.555+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:44:45.558+08:00 ERROR cluster/cluster.go:203 failed start many times(2), start others if they're not online, otherwise purge this member, clean data directory and rejoin it back. 2022-01-18T15:45:50.558+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context deadline exceeded 2022-01-18T15:45:50.558+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:45:50.561+08:00 ERROR cluster/cluster.go:203 failed start many times(3), start others if they're not online, otherwise purge this member, clean data directory and rejoin it back. 2022-01-18T15:46:55.562+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled

log for pod-1

2022-01-18T15:39:52.499+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:39:52.499+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:39:52.499+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:39:52.5+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:40:52.5+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context deadline exceeded 2022-01-18T15:40:52.5+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:50:53.745+08:00 INFO cluster/cluster.go:622 hard stop server 2022-01-18T15:50:53.745+08:00 ERROR cluster/cluster.go:267 start server timeout(10m0s) 2022-01-18T15:50:56.441+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:50:56.441+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:50:56.442+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:50:56.443+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:51:56.444+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:51:56.444+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T16:01:57.83+08:00 INFO cluster/cluster.go:622 hard stop server 2022-01-18T16:01:57.831+08:00 ERROR cluster/cluster.go:267 start server timeout(10m0s)

log for pod-2

2022-01-18T15:40:59.968+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:40:59.97+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:40:59.97+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:40:59.971+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:41:59.971+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:41:59.972+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:41:59.976+08:00 ERROR cluster/cluster.go:197 start cluster failed (1 retries): start server failed: wal: crc mismatch 2022-01-18T15:42:48.931+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:42:48.932+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:42:48.932+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:42:48.932+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:43:48.933+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:43:48.933+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:43:48.936+08:00 ERROR cluster/cluster.go:197 start cluster failed (1 retries): start server failed: wal: crc mismatch 2022-01-18T15:44:53.938+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled

I'm not sure what the problem might be, please help.

brzyangg avatar Jan 18 '22 08:01 brzyangg

@brzyangg I noticed there was line 2022-01-18T15:43:40.553+08:00 ERROR cluster/cluster.go:197 start cluster failed (1 retries): start server failed: wal: crc mismatch in the logs of the pod-0 and similar one for pod-2 too. It looks like the content of data-dir is corrupted. I was able to reproduce the same error and failure of the cluster by modifying the number of lines in data/member/wal/0000000000000000-0000000000000000.wal (you might have different names for the .wal file). It seems that the .wal files can get corrupted at least due to some disk problems https://github.com/etcd-io/etcd/issues/11488

samutamm avatar Jan 19 '22 01:01 samutamm

@samutamm thank you. Let's check further.

brzyangg avatar Jan 19 '22 02:01 brzyangg

@brzyangg The data corruption might happen if Easegress was writing to the data at the same time the backup process copied the data and data was left in inconsistent state.

You can consider creating backup for each node one by one and shutting down the Easegress instance of the current node before copying data. After copy restart the Easegress instance. This should avoid the data inconsistency issue.

As an alternative, depending of your use case, you can also store all yaml configurations elsewhere (git for example) and in case of disaster and recovery just start the cluster with no data at all. Then re-create all objects (HTTPServer, HTTPPipeline, filter...) using egctl tool.

I hope this helps ^^

samutamm avatar Jan 19 '22 07:01 samutamm

@samutamm I have no problem using your method of copying data on a machine. But in kubernetes it is easy to encounter that problem. Consider possible reasons for data inconsistencies. Let's try a few more plans. If the operation is more complicated in kubernetes, we will use the 'egctl' you said to export all objects. Thank you very much for your support!

brzyangg avatar Jan 19 '22 07:01 brzyangg