openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Create replicas of key systems to enable automatic failover during downtime

Open cclauss opened this issue 2 years ago • 2 comments

Epic / Tracking Issue for a significant work effort.

I am always frustrated when our service is down for our users.

Describe the problem that you'd like solved

Our platform is increasingly mission-critical for users around the globe so let's leverage our Docker-based architecture to implement automatic failover of key services.

The following list is in the recommended order of implementation:

Services currently running on multiple servers

  • [x] memcache service failover between hosts ol-mem0, ol-mem1, ol-mem2 running memcached on bare metal
  • [ ] database failover between hosts ol-db1 (primary) and ol-db2 (backup) running Postgres on bare metal
  • [ ] web service failover between hosts ol-web1 and ol-web2 running Docker container openlibrary-web-1
  • [ ] Solr service failover between hosts ol-solr0 and ol-solr1 running Docker container openlibrary_solr_1

Services currently running on a single server

  • [ ] cover images service failover between Docker containers openlibrary-covers-1 and openlibrary-covers-2
  • [ ] home services running on ol-home0 running seven different Docker containers
  • [ ] www services running on ol-www0 running haproxy and nginx Docker containers

It will be important to distinguish services that will operate in primary/backup mode (like database) from those which will operate in load-sharing / parallel mode (like Memcache). We will need to document and test the failover conditions and constraints. For example, failure of the primary database server might put the site on read-only mode on the backup server.

Proposal & Constraints

Document and implement a failover approach for each of the services listed above and then use chaos monkey-like testing to ensure service resilience in the face of unplanned software, operating system, and hardware failure.

The hosts in a failover pair must be placed on different virtual machines to ensure resilience to hardware failures. This should also simplify the process of planned downtime and hardware migration while also distributing workloads among virtual machines.

Many of these services might require a two-step migration to failover. The minimum-viable-failover phase will prove basic service failover while documenting but not solving all corner cases. The full failover phase will improve automation and solve all documented corner cases.

Tracking issue

database failover:

  • [ ] Something we should learn as we set up “fail-over” on our servers, is whether Open Library can continue to operate when ol-db1 goes down (i.e. auto switch to ol-db2 in read-only mode)
  • [ ] Configure ol-db1 & ol-db2 networking so that if/when ol-db1 goes down, Open Library is able to gracefully switch to ol-db2 in read-only mode
    • [ ] openlibrary.yml config specifies that ol-db1 is our database so what change would enable failover?
    • [ ] Q: Is it possible (in networking land) for ol-db2 to take over the IP or hostname of ol-db1 in the event of an outage?
    • [ ] If not, how are these failovers typically done?
    • [ ] Q: Should infobase be modified to implement failover when establishing each database session or is there a better approach?
    • [ ] Are there significant differences between software v. hardware outages?

database upgrade:

  • [x] #5686
  • [ ] #5675

Stakeholders

@abezella @mekarpeles @cdrini @scottbarnes

cclauss avatar Jul 07 '23 09:07 cclauss

It is important that these multiple servers/VMs run in different availability zones without shared location, power, upstream infrastructure, etc. Currently one of the failure modes is "Hey, we're going to shut off the power in our (only) data center for hours." With the ubiquitous availability of cloud computing services, it's cheap to build local/cloud hybrid solutions without having to invest in geographically dispersed data centers for disaster recovery.

tfmorris avatar Jul 07 '23 18:07 tfmorris

Assignees removed automatically after 14 days.

github-actions[bot] avatar Jan 25 '24 08:01 github-actions[bot]