Create replicas of key systems to enable automatic failover during downtime
Epic / Tracking Issue for a significant work effort.
I am always frustrated when our service is down for our users.
Describe the problem that you'd like solved
Our platform is increasingly mission-critical for users around the globe so let's leverage our Docker-based architecture to implement automatic failover of key services.
The following list is in the recommended order of implementation:
Services currently running on multiple servers
- [x] memcache service failover between hosts
ol-mem0,ol-mem1,ol-mem2runningmemcachedon bare metal - [ ] database failover between hosts
ol-db1(primary) andol-db2(backup) running Postgres on bare metal - [ ] web service failover between hosts
ol-web1andol-web2running Docker containeropenlibrary-web-1 - [ ] Solr service failover between hosts
ol-solr0andol-solr1running Docker containeropenlibrary_solr_1
Services currently running on a single server
- [ ] cover images service failover between Docker containers
openlibrary-covers-1andopenlibrary-covers-2 - [ ] home services running on
ol-home0running seven different Docker containers - [ ] www services running on
ol-www0runninghaproxyandnginxDocker containers
It will be important to distinguish services that will operate in primary/backup mode (like database) from those which will operate in load-sharing / parallel mode (like Memcache). We will need to document and test the failover conditions and constraints. For example, failure of the primary database server might put the site on read-only mode on the backup server.
Proposal & Constraints
Document and implement a failover approach for each of the services listed above and then use chaos monkey-like testing to ensure service resilience in the face of unplanned software, operating system, and hardware failure.
The hosts in a failover pair must be placed on different virtual machines to ensure resilience to hardware failures. This should also simplify the process of planned downtime and hardware migration while also distributing workloads among virtual machines.
Many of these services might require a two-step migration to failover. The minimum-viable-failover phase will prove basic service failover while documenting but not solving all corner cases. The full failover phase will improve automation and solve all documented corner cases.
Tracking issue
- [x] Document which of our hosts are on which Kernel-based Virtual Machine hosts and which hosts have SSDs, etc.
database failover:
- [ ] Something we should learn as we set up “fail-over” on our servers, is whether Open Library can continue to operate when
ol-db1goes down (i.e. auto switch tool-db2in read-only mode) - [ ] Configure
ol-db1&ol-db2networking so that if/whenol-db1goes down, Open Library is able to gracefully switch tool-db2in read-only mode- [ ]
openlibrary.ymlconfig specifies thatol-db1is our database so what change would enable failover? - [ ] Q: Is it possible (in networking land) for
ol-db2to take over the IP or hostname ofol-db1in the event of an outage? - [ ] If not, how are these failovers typically done?
- [ ] Q: Should infobase be modified to implement failover when establishing each database session or is there a better approach?
- [ ] Are there significant differences between software v. hardware outages?
- [ ]
database upgrade:
- [x] #5686
- [ ] #5675
Stakeholders
@abezella @mekarpeles @cdrini @scottbarnes
It is important that these multiple servers/VMs run in different availability zones without shared location, power, upstream infrastructure, etc. Currently one of the failure modes is "Hey, we're going to shut off the power in our (only) data center for hours." With the ubiquitous availability of cloud computing services, it's cheap to build local/cloud hybrid solutions without having to invest in geographically dispersed data centers for disaster recovery.
Assignees removed automatically after 14 days.