pg_auto_failover icon indicating copy to clipboard operation
pg_auto_failover copied to clipboard

Make PG-Auto-Failover high-availability

Open d-maumary opened this issue 2 years ago • 13 comments

I love PG-Auto-Failover, but my colleagues & managers shout "Single point of failure!" when I show them the diagram. My suggestion is to run the monitor on a Postgres cluster, where the main & replica monitor each other, and the replica takes over if the primary fails. The worker nodes would use multi-host connection strings to connect to the primary monitor instance. The idea is similar to MariaDB's MaxScale (where there IS a single point of failure, as Maxscale is the proxy!) which has an active & a passive instance.

d-maumary avatar Jan 25 '22 05:01 d-maumary

@DimCitus while this is probably not on the roadmap for some time, if someone were to try to do this on their own, what steps would be needed to take here?

As far as I understand the architecture, this should be possible if we replaced the monitor node with another pg_auto_failover cluster which we then use as the monitor on the data nodes, right?

I don't expect this to work nicely with the pg_aucoctl command, but I am just thinking about this. Maybe some kind of "bring your own monitor" feature would be sufficient for people that want to do this kind of bullet proof HA.

s4ke avatar Mar 25 '22 10:03 s4ke

hi @d-maumary and @s4ke. The trouble with having the monitor an HA system all over again, is the following: who monitors the monitor? The only way forward that I can see would be to use a distributed consensus solution for the monitor then, ala Patroni, but it makes the deployment more complex, the failure cases more complex, the debugging of the system more complex, and fixing the cluster when something wrong happens even more complex.

Rather than try to make it never fail, we have implemented a solution that's easy to maintain. Some cases are not handled, but at least we don't pretend they are and then take the wrong decisions.

About trying things around, we have full support for running pg_auto_failover without a monitor in the existing code base. See https://github.com/citusdata/pg_auto_failover/blob/master/tests/test_monitor_disabled.py for the regression testing support of running without a monitor. You could write your own monitor and then use the local pg_autoctl commands pg_autoctl do fsm nodes set and pg_autoctl do fsm assign etc to control the operations.

DimCitus avatar Mar 25 '22 12:03 DimCitus

I 100% understand your points and think this is a good route to take for the space that pg_auto_failover is trying to take. It's simple and predictable.

I actually think that being able to pg_auto_failover without a monitor is a good pointer for anyone that wants to use the simplicity of the fsm, but wants to bring their own monitor implementation. I was not aware that this was the case, but this is a great design decision and actually validates our decision to go for pg_auto_failover all over again.

s4ke avatar Mar 25 '22 15:03 s4ke

Something that comes to mind from a marketing standpoint though is that it would be easier to "sell" the fact that the monitor is not HA if there was a documented path on how to recover from a lost monitor.

EDIT:

nvm. seems like I was out of date with my documentation reading:

https://pg-auto-failover.readthedocs.io/en/latest/operations.html#monitoring-pg-auto-failover-in-production

s4ke avatar Mar 25 '22 16:03 s4ke

This is also touched on in the FAQ: https://github.com/citusdata/pg_auto_failover/blob/master/docs/faq.rst#the-monitor-is-a-spof-in-pg_auto_failover-design-how-should-we-handle-that

xzilla avatar Mar 25 '22 20:03 xzilla

Just something that might be of interest for people that are concerned with this issue:

A simple way to get alerted on these issues where you might have to manually intervene, you could use an active healthcheck tool like uptime kuma and point it to a simple http api that returns a non OK status code if something in the server is not okay.

After this discussion I opted to write such a small service, you can find it here:

https://github.com/neuroforgede/pg_auto_failover_ansible/tree/master/tools/health_monitor

s4ke avatar Mar 28 '22 15:03 s4ke

A simple way to get alerted on these issues where you might have to manually intervene, you could use an active healthcheck tool like uptime kuma and point it to a simple http api that returns a non OK status code if something in the server is not okay.

Would be nice that pg_autoctl start such a simple HTTP web server with a very simple API, that could include that status code among other things. I had a PR prepared for that a while ago, in the context of disabled monitor. Might be nice to start the HTTP service idea again with this simple idea. Would you be up for opening a PR with that?

DimCitus avatar Mar 28 '22 16:03 DimCitus

@DimCitus this would be a service that is included in the pg_autoctl binary (=> so it would be required to be written in C) or could this be a separate binary (=> so python would be fine) that is managed by pg_autoctl?

I am asking, because currently I definitely will not have the time to get into the C codebase.

s4ke avatar Mar 28 '22 17:03 s4ke

Well I was thinking we should include it in the pg_autoctl binary and then yes write it in C. Now, I have also been thinking of using either Guile and then their Web Server integrated module, or maybe ECL.

DimCitus avatar Mar 28 '22 18:03 DimCitus

Actually, we could probably embed https://sqlite.org/althttpd/doc/trunk/althttpd.md as a sub-process to pg_auctoctl with a built-in setup and CGI scripts. We would mostly use the application/json mime type anyway, and most pg_autocyl commands have a --json switch already, meaning we might be able to just plug the existing commands to the CGI setup with an URL mapping.

When we have that, then later we can have a look at providing HTML output and more capabilities for the HTTPd setup, but we might also never have to actually do that.

DimCitus avatar Apr 04 '22 17:04 DimCitus

This article https://luppeng.wordpress.com/2020/11/11/building-resiliency-in-the-pg_auto_failover-monitor/ raises an interesting point which I think could be explored.

While I don't think everyone wants to run their postgres application inside Docker and/or Kubernetes because it adds complexity to the network stack, I think moving just the Monitor into a Docker Swarm or Kubernetes based Deployment together with a Volume attached to the container might solve this.

I am leaning towards using Docker Swarm due to its simplicity for a Setup like this:

Deploy a 3 node Docker Swarm with 3 managers using tiny vms and provision a cloud volume for usage with the monitor. Run a service with exactly 1 replica for the monitor. Bind mount the port to the outside and then use a virtual IP that points to any of the three managers and then let ingress routing do the rest.

The only issue with that is that you again create extra infrastructure that might fail. But it could help sell people the idea of pg_auto_failover.

Just a thought @DimCitus , does pg_autoctl support connecting to the host with a libpq style multihost connection string? If yes, we could leave out the virtual ip as well.

s4ke avatar Jun 03 '22 16:06 s4ke

If pgaf does not support connecting to multi host connection string monitors, one can always rely on HAProxy to select any viable IP for the monitor.

I feel like this is something that could be worth a try.

s4ke avatar Jun 03 '22 17:06 s4ke

While I don't think everyone wants to run their postgres application inside Docker and/or Kubernetes because it adds complexity to the network stack, I think moving just the Monitor into a Docker Swarm or Kubernetes based Deployment together with a Volume attached to the container might solve this.

This is touched in the docs at https://pg-auto-failover.readthedocs.io/en/master/faq.html#the-monitor-node-can-be-brought-up-again-without-data-having-been-lost. If you're using Kubernetes or another orchestration facility where you can attach a volume storage (in a safe way, fsync compliant and all) then when the monitor is lost, probably the volume isn't. Just re-schedule a Postgres node and attach the volume again. Done.

When the volume has been corrupted that's a different story altogether. PITR is needed, as always with Postgres. The docs hints a method to handle PITR for the monitor and at some point I though we could include that in pg_auto_failover itself, after having done the PITR for Postgres nodes parts. This work is not happening at the moment though.

DimCitus avatar Jun 07 '22 14:06 DimCitus