gatus High availability mode

High availability mode

Open TwiN opened this issue 2 years ago • 3 comments

This feature would allow more than one replica of Gatus with the exact same configuration to coexist by leveraging leader election through the new postgres storage.type.

Programmatically, this is how I envision it to work:

First instance of Gatus, henceforth G1, starts.
G1 tries to acquire lock by querying the new instance table in the Postgres database.
Because the row specifying whether an instance has claimed the role of leader does not exist yet, G1 creates a row with the column label set to default, the role set to LEADER and the last_heartbeat set to CURRENT_TIMESTAMP.
G1 is now the leader, therefore it begins monitoring the services configured.
Every minute, G1 updates the timestamp in the Postgres database.
Second instance of Gatus, henceforth G2, starts.
G2 tries to acquire the writer lock by querying the instance table in the Postgres database for the label default and the role LEADER.
G2 fails to acquire the lock, because another instance has already acquired it and the last_heartbeat timestamp is within the past 5 minutes. This 5 minutes shall be defined as time until reelection.
G2 tries to acquire the writer lock every 2 minutes.
Now, let's assume that G1 runs into an issue and crashes.
G1 restarts, tries to acquire the lock, but as documented by step 8, it fails.
5 minutes goes by and the time for reelection has come, after which either G1 or G2 will grab the lock.

During this entire time, both G1 and G2 can read from the database, and therefore handle HTTP requests. The only restriction is that no more than one leader for one label can write at any given time.

distributed:
  mode: HA
  label: default

The parameter distributed.label is optional, and will default to the value default.

Why do we need a label?

This will be needed for #64 -- basically, let's say you wanted to deploy Gatus in 3 isolated environments which all have access to the postgres database, let's call them alpha, bravo and charlie. Of course, each environment has their own set of services to monitor.

You'd use the label to differentiate these environments and allow one leader per environment to push their data in the database, all while allowing each separate environment to be highly available.

Requirements:

storage.type must be set postgres

Sep 17 '21 01:09 TwiN

Could you make HA available without usage of a database ?

If we know by advance the endpoint (IP) of all gatus, we could simply list them in the configuration and they can elect a leader by talking to each other. One of known algorithm to do that is Raft https://raft.github.io/

Sep 30 '21 08:09 guillomep

Could you make HA available without usage of a database ?

If we know by advance the endpoint (IP) of all gatus, we could simply list them in the configuration and they can elect a leader by talking to each other. One of known algorithm to do that is Raft https://raft.github.io/

I think an easier/quicker path to HA might be to model it after Prometheus and leverage Alertmanager to de-dupe alerts.

I've only taken a brief look so far but I think the existing custom notification will work with Alertmanager so long as the notification limiter is commented out.

Apr 10 '22 06:04 BrianInAz

Hi there, I think this issue lost a bit of traction. Is there any other status on this topic, then what is described in this issue?

Jan 12 '24 09:01 beatkind

gatus gatus copied to clipboard

High availability mode

Why do we need a label?

Requirements:

gatus
gatus copied to clipboard