gatus
gatus copied to clipboard
High availability mode
This feature would allow more than one replica of Gatus with the exact same configuration to coexist by leveraging leader election through the new postgres storage.type
.
Programmatically, this is how I envision it to work:
- First instance of Gatus, henceforth G1, starts.
- G1 tries to acquire lock by querying the new
instance
table in the Postgres database. - Because the row specifying whether an instance has claimed the role of leader does not exist yet, G1 creates a row with the column
label
set todefault
, therole
set toLEADER
and thelast_heartbeat
set toCURRENT_TIMESTAMP
. - G1 is now the leader, therefore it begins monitoring the services configured.
- Every minute, G1 updates the timestamp in the Postgres database.
- Second instance of Gatus, henceforth G2, starts.
- G2 tries to acquire the writer lock by querying the
instance
table in the Postgres database for the labeldefault
and the roleLEADER
. - G2 fails to acquire the lock, because another instance has already acquired it and the
last_heartbeat
timestamp is within the past 5 minutes. This 5 minutes shall be defined as time until reelection. - G2 tries to acquire the writer lock every 2 minutes.
- Now, let's assume that G1 runs into an issue and crashes.
- G1 restarts, tries to acquire the lock, but as documented by step 8, it fails.
- 5 minutes goes by and the time for reelection has come, after which either G1 or G2 will grab the lock.
During this entire time, both G1 and G2 can read from the database, and therefore handle HTTP requests. The only restriction is that no more than one leader for one label can write at any given time.
distributed:
mode: HA
label: default
The parameter distributed.label
is optional, and will default to the value default
.
Why do we need a label?
This will be needed for #64 -- basically, let's say you wanted to deploy Gatus in 3 isolated environments which all have access to the postgres database, let's call them alpha
, bravo
and charlie
. Of course, each environment has their own set of services to monitor.
You'd use the label
to differentiate these environments and allow one leader per environment to push their data in the database, all while allowing each separate environment to be highly available.
Requirements:
-
storage.type
must be setpostgres
Could you make HA available without usage of a database ?
If we know by advance the endpoint (IP) of all gatus, we could simply list them in the configuration and they can elect a leader by talking to each other. One of known algorithm to do that is Raft https://raft.github.io/
Could you make HA available without usage of a database ?
If we know by advance the endpoint (IP) of all gatus, we could simply list them in the configuration and they can elect a leader by talking to each other. One of known algorithm to do that is Raft https://raft.github.io/
I think an easier/quicker path to HA might be to model it after Prometheus and leverage Alertmanager to de-dupe alerts.
I've only taken a brief look so far but I think the existing custom notification will work with Alertmanager so long as the notification limiter is commented out.
Hi there, I think this issue lost a bit of traction. Is there any other status on this topic, then what is described in this issue?