Temporal
Temporal copied to clipboard
Database Refactor - Better HA + Fault Tolerance
Overview
In light of the temporary impact to IPFS HTTP API directory uploads caused by a database sync issue, we need to refactor the way our database tooling works. We need better HA, and Fault Tolerance so if another repeat of the incident happens, we can automatically fail-over to a working database.
Our current database system consists of three nodes all in logical replication, allowing us to conduct manual fail-over in the event of an incident, and ensures that we have backups of our databases, as well as hourly backups. However this isn't as smooth as it can be.
While this endeavour falls on my to accomplish, it has the help wanted
label as this is an area of database administration I'm not familiar with, and would welcome community input.
End Goals
- Multi-master replication
- Automatic fail-over
- Load balanced requests
Research
track research notes and such
Possible Implementations
Will contain analysis, pros, cons, etc... of the available solutions
Standby Databases
Clusters
DRBD (Distributed Replicated Block Device)
- Corosync + Pacemaker + DRBD
Pgpool II
Citus CE
Postgres-XL
CockroachDB
Bucardo
Links
- https://severalnines.com/database-blog/top-pg-clustering-high-availability-ha-solutions-postgresql
- https://www.postgresql.org/docs/10/high-availability.html
- https://wiki.postgresql.org/wiki/Main_Page
going to use cockroachdb as it appears to be the easiest to maintain. the other solutions seem to require a pretty solid understanding of databases, and general DBA stuff which I definitely do not know; I dropped out of a college program to become a DBA so yea :joy: