easegress
easegress copied to clipboard
[question] backup and recovery
Is there a backup and recovery plan? How can I save my cluster data and recover it quickly in the event of a failure.
By default, easegress
will store cluster data into etcd in your local disk. When restart, easegress
will autoload these stored data (yaml files) and create corresponding object (http server, http pipelines) for you. Is this what you mean?
What I mean is that the data on my local disk may be lost due to some unpredictable problems. I hope to save a snapshot of the data regularly. So that you can recover as soon as possible in the event of data loss.
@brzyangg Thank you for the question. Here's briefly the instructions for backup and recovery:
Easegress yaml configurations
When developing with Easegress, the best practice is to store all yaml configurations to VCS such as git. Especially important is to add the Easegress server configuration to git, as you will need it when recovering from backup. Please note that if you start Easegress with default parameters (e.g. ./easegress-server
with no flags), you don't have configuration to store. Still it's best to add all object definitions (HTTPServer, HTTPPipeline, filters etc) to git to ensure the best developer experience.
When object definition YAML files are stored elsewhere (e.g. git), the disaster recovery becomes very simple. Just start Easegress (single node or cluster) with fresh disk (data-dir
and member-dir
are empty) and re-create all objects using egctl object create -f easegress-objects.yaml
where easegress-objects.yaml
contains all objects.
If you are using AutoCertManager
to update automatically TLS certificates using Let´s Encrypt, AutoCertManager
will renew the certificates after disaster and restarting Easegress. You don't need to backup these certificates elsewhere.
In most of the cases recreating Easegress without data and recreating the objects is sufficient for disaster recovery. However if you still need to create backups of the data and use those backups to recover after disaster, here are the different cases to consider.
Backup single node Easegress
You can backup periodically the data-dir
folder that Easegress creates at startup. The default path is ./data
but you might have different value in your Easegress configuration.
Recover single node Easegress
Copy backup data-dir
directory to suitable location on your node and start Easegress server with the same configuration, as initial instance. Ensure that the data-dir
(by default ./data
) points to restored backup folder.
Backup multi-node Easegress
If you have multiple Easegress nodes running in a cluster mode (either cluster-join-urls
or cluster.initial-cluster
defined in Easegress configuration), there are two variant to consider:
Using cluster-join-urls
Easegress v1.4.0 introduced new simplified way to configure and manage cluster using cluster.initial-cluster
YAML entry. Please consider updating your configuration to this new way, following this cookbook.
If you don't want to update right now, here's how you should backup your Easegress cluster: Save periodically backup of data-dir
(default value ./data
) and member-dir
(default value ./member
) of each node. So for cluster of size N you need to backup N * 2 folders.
Using cluster.initial-cluster
You only need to backup the data-dir
(default value ./data
). You can export it periodically from one node, from each node or you can alternate the node to backup each time, for example.
Restore multi-node Easegress
Let's consider again the two variants, depending if you are using cluster-join-urls
of cluster.initial-cluster
in your Easegress configuration.
Using cluster-join-urls
Restore the corresponding data-dir
and member-dir
of each node and restart Easegress with initial configurations. Please ensure that the restored data-dir
and member-dir
are in the locations defined in your configuration.
Using cluster.initial-cluster
Restore the latest data-dir
to only one node and restart Easegress on this node using easegress-server -f config.yaml --state-flag "existing"
where config.yaml
is the initial configuration. Please ensure that the restored data-dir
is in the location defined in your configuration. For the rest of the nodes, ensure that data-dir
location does not exist and restart Easegress using easegress-server -f config.yaml --state-flag "existing"
. Easegress state is synchronized automatically to all nodes.
Note on copying data-dir
When copying data-dir
and member-dir
, Easegress might be currently writing to these files. Copy operation, while Easegress is running, might cause data corruption. To avoid this issue, you can first shutdown Easegress process of the node you want to backup and then copy the data folder. With 3 nodes or more (odd number), it is safe to shutdown Easegress on one node before backup and restart after backup is done.
@samutamm Thank you so much. I try this way.
@brzyangg Have you had chance to try these instructions in your environment?
@samutamm Hi, thank you for your support, I have some problems in my environment.
My easegress
using cluster-join-urls
and deploy in kubernetes. It is a statefulset
and has three pods.
I want to simulate a failure and recover.
First, I copied and saved the data-dir
and member-dir
of three pods.
Delete the data-dir
and member-dir
of all pods. The cluster is currently unavailable.
I copied the data-dir
and member-dir
in as you suggested.
After these operations, the cluster cannot recover.
This is the log of my node
log for pod-0
2022-01-18T15:39:38.932+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:39:38.933+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:39:38.933+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:39:38.934+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:40:38.935+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:40:38.935+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:40:38.94+08:00 ERROR cluster/cluster.go:197 start cluster failed (1 retries): start server failed: wal: crc mismatch 2022-01-18T15:41:43.941+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:41:43.941+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:41:43.944+08:00 ERROR cluster/cluster.go:203 failed start many times(2), start others if they're not online, otherwise purge this member, clean data directory and rejoin it back. 2022-01-18T15:42:40.547+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:42:40.547+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:42:40.548+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:42:40.548+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:43:40.549+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:43:40.549+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:43:40.553+08:00 ERROR cluster/cluster.go:197 start cluster failed (1 retries): start server failed: wal: crc mismatch 2022-01-18T15:44:45.554+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:44:45.555+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:44:45.558+08:00 ERROR cluster/cluster.go:203 failed start many times(2), start others if they're not online, otherwise purge this member, clean data directory and rejoin it back. 2022-01-18T15:45:50.558+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context deadline exceeded 2022-01-18T15:45:50.558+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:45:50.561+08:00 ERROR cluster/cluster.go:203 failed start many times(3), start others if they're not online, otherwise purge this member, clean data directory and rejoin it back. 2022-01-18T15:46:55.562+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled
log for pod-1
2022-01-18T15:39:52.499+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:39:52.499+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:39:52.499+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:39:52.5+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:40:52.5+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context deadline exceeded 2022-01-18T15:40:52.5+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:50:53.745+08:00 INFO cluster/cluster.go:622 hard stop server 2022-01-18T15:50:53.745+08:00 ERROR cluster/cluster.go:267 start server timeout(10m0s) 2022-01-18T15:50:56.441+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:50:56.441+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:50:56.442+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:50:56.443+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:51:56.444+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:51:56.444+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T16:01:57.83+08:00 INFO cluster/cluster.go:622 hard stop server 2022-01-18T16:01:57.831+08:00 ERROR cluster/cluster.go:267 start server timeout(10m0s)
log for pod-2
2022-01-18T15:40:59.968+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:40:59.97+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:40:59.97+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:40:59.971+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:41:59.971+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:41:59.972+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:41:59.976+08:00 ERROR cluster/cluster.go:197 start cluster failed (1 retries): start server failed: wal: crc mismatch 2022-01-18T15:42:48.931+08:00 INFO server/main.go:58 Easegress release: UNKNOWN, repo: UNKNOWN, commit: UNKNOWN 2022-01-18T15:42:48.932+08:00 INFO profile/profile.go:76 cpu profile: /data/palgress/cpu.profile 2022-01-18T15:42:48.932+08:00 INFO cluster/cluster.go:389 client connect with endpoints: [http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380 http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380] 2022-01-18T15:42:48.932+08:00 INFO cluster/cluster.go:402 client is ready 2022-01-18T15:43:48.933+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled 2022-01-18T15:43:48.933+08:00 INFO cluster/config.go:126 etcd config: init-cluster:test-snapshot-palgress-0=http://test-snapshot-palgress-0.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-1=http://test-snapshot-palgress-1.test-snapshot-palgress.service.svc.cluster.local:2380,test-snapshot-palgress-2=http://test-snapshot-palgress-2.test-snapshot-palgress.service.svc.cluster.local:2380 cluster-state:existing force-new-cluster:false 2022-01-18T15:43:48.936+08:00 ERROR cluster/cluster.go:197 start cluster failed (1 retries): start server failed: wal: crc mismatch 2022-01-18T15:44:53.938+08:00 ERROR cluster/cluster.go:249 add self to cluster failed: context canceled
I'm not sure what the problem might be, please help.
@brzyangg I noticed there was line
2022-01-18T15:43:40.553+08:00 ERROR cluster/cluster.go:197 start cluster failed (1 retries): start server failed: wal: crc mismatch
in the logs of the pod-0 and similar one for pod-2 too. It looks like the content of data-dir is corrupted. I was able to reproduce the same error and failure of the cluster by modifying the number of lines in data/member/wal/0000000000000000-0000000000000000.wal
(you might have different names for the .wal file). It seems that the .wal
files can get corrupted at least due to some disk problems https://github.com/etcd-io/etcd/issues/11488
@samutamm thank you. Let's check further.
@brzyangg The data corruption might happen if Easegress was writing to the data at the same time the backup process copied the data and data was left in inconsistent state.
You can consider creating backup for each node one by one and shutting down the Easegress instance of the current node before copying data. After copy restart the Easegress instance. This should avoid the data inconsistency issue.
As an alternative, depending of your use case, you can also store all yaml configurations elsewhere (git for example) and in case of disaster and recovery just start the cluster with no data at all. Then re-create all objects (HTTPServer, HTTPPipeline, filter...) using egctl
tool.
I hope this helps ^^
@samutamm I have no problem using your method of copying data on a machine. But in kubernetes it is easy to encounter that problem. Consider possible reasons for data inconsistencies. Let's try a few more plans. If the operation is more complicated in kubernetes, we will use the 'egctl' you said to export all objects. Thank you very much for your support!