backup-utils
backup-utils copied to clipboard
Built in support for warm DR standby
Setting up a warm standby VM is currently fairly straightforward once basic backups are in place but could benefit from built in support in github/backup-utils. The basic idea is to configure a new VM (possibly in another DC) and leave it in maintenance mode. Then modify the scheduled backup run to use the following instead of just ghe-backup:
ghe-backup && ghe-restore <standby-ip>
The Git backup and restore portions are fully incremental so this should be efficient enough to schedule as regularly as every hour, bringing the RPO down to an acceptable level. The RTO could be as low as minutes with a warm standby VM, or could be ~1 hour if people prefer to opt for a cold / provision-VM-at-time-of-recovery setup. Both options should be available and the choice will be based on how much cash / operational work people want to take on vs. optimizing the RPO/RTO. No changes to the current 11.10.343 release are necessary for this.
I ran through the basic process of setting up a warm standby VM yesterday and wanted to document the process. Let's assume the primary GHE instance is at "github.example.com". The process for setting up a standby is:
- Run
ghe-backupagainst the primary to get a first successful snapshot. - Boot a new 11.10.320 VM to act as the standby and record the
<standby-ip>. - Create a DNS entry: "github-standby.example.com" pointed to
<standby-ip>. This should be set with a low TTL (like 5 minutes). The main "github.example.com" DNS entry should also have a low TTL. - Upload license and 11.10.342 GHP to the standby VM via http://github-standby.example.com/setup.
- Add the backup site SSH key to authorized keys in manage http://github-standby.example.com/setup/settings.
- Run
ghe-standby github-standby.example.com(WIP version here) from the backup site. This is essentiallyghe-maintenance -s && ghe-import-settings && sudo enterprise-configureon the remote side. It puts the standby in maintenance mode and loads in settings from the last snapshot. The standby VM will stay in maintenance mode until it's activated. - Should also run
ghe-import-ssh-host-keyshere but that changes the host key signature and will cause excessive SSH warnings and prompts. We can find a way to fit this into theghe-standbyscript sanely. - Perform an initial restore of the latest snapshot with
ghe-restore github-standby.example.com. - Schedule the backup run as
ghe-backup && ghe-restore github-standby.example.com.
At this point, we have backups being taken and loaded into the standby on a regular basis. The process for recovery / failing over is:
- Put the primary instance in maintenance mode (if it's still up and available).
- Take the standby instance out of maintenance mode. I have a WIP
ghe-activate github-standby.example.comscript going here. This just takes the standby instance out of maintenance mode viaghe-maintenance -u. - Check that https://github-standby.example.com is up and working.
- Point github.example.com DNS to
<standby-ip>. - Point github-standby.example.com DNS to the old primary if it should take over as the standby host. If it's borked, start at the beginning and set up a new standby VM.
A couple things that are blocking continuous restores to a warm standby:
ghe-import-redisbacks up the currentredis.rdbfile to/data/redis/redis.rdb.<timestamp>.bakbut never cleans them up. We'll exhaust disk space with these if restores happened every hour.ghe-import-es-indiceshas a similar problem. The current set of ES indexes are backed up to/home/admin/elasticsearch-indices.<timestamp>before the new indexes are put in place. This will fill up disk pretty quickly.
We can clean up the backups from the restore scripts but ideally the server side scripts would be retooled to keep only a few of the most recent backups here. Alternatively, it might be nice to be able to pass an option to the ghe-import-* scripts that tells them to avoid backing these things up altogether. It's useful when restoring to an existing VM but a warm standby will never have had data we'd want to keep around and these operations take time.
I would like that as well. DR is a major use case. Ideally, I don't want a 3rd VM to schedule this. The warm standby VM should be running ghe-backup and ghe-restore onto itself
Ideally, I don't want a 3rd VM to schedule this. The warm standby VM should be running ghe-backup and ghe-restore onto itself
:+1: That's definitely the approach I think we'll be taking here in a future release.
If you go down that path, watch out for disk space. In my primary GHE, the / is the standard 75GB and /data/repositories is a much larger volume (say 500GB). The standby GHE has the same config.
If the 500GB is sufficiently used, the standby VM won't have enough space to perform ghe-backup. Please add a location that is writable by the admin user to store those temp backup files and that volume must be extendable.
Few questions:
- I can delete those /home/admin/elasticsearch-indices.
after the ghe-restore, but how I can I remove /data/redis/redis.rdb. .bak since they belong to root user? This alone is a blocker to use this feature. - step 2, why do we need ghe-activate? Can we just use ghe-maintenance directly?
- why is ghe-import-settings needed before the backup even took place? It seems that is something we would do after the restore. What bothers me is the hostname of the standby VM is set to github.example.com but the DNS is pointing to the primary still. That is inconsistent on the network. And I suspect it would render the standby unusable
- where can I find ghe-import-settings & sudo enterprise-configure scripts?
Is this issue still relevant, now that GHE2 has become available?
I would still like to have a better story around recovery time from backup in a separate datacenter, including the ability to continuously restore each backup to an instance in standby mode. The pieces are there to do this today but right now our documentation and testing is limited to restoring cold with a new instance. Needs testing, ironing out any remaining issues, and documentation.
I'd also like to get something in place for @quocvu's suggestion in https://github.com/github/backup-utils/issues/33#issuecomment-55058199 of shipping backup-utils on the GHE appliance itself, being able to use it as the backup host, and having an out-of-the-box configuration that lets the backup host act as the standby instance. I think that can be split out from this issue, though.