pg_auto_failover icon indicating copy to clipboard operation
pg_auto_failover copied to clipboard

Pg_auto_failover with HAProxy

Open adnanhamdussalam opened this issue 1 year ago • 9 comments

Hi,

Is it possible to configure the pg_auto_failover with haproxy like if the primary becomes standby and standby becomes primary the pg_auto_failover can communicate its status to haprxoy so haproxy can redirect the write request to primary and read request to standby ?

adnanhamdussalam avatar Nov 04 '24 12:11 adnanhamdussalam

As far as I know, it's not possible. Because you have to maintain the mointor nodes with another mointor.

wxmeng04 avatar Nov 13 '24 03:11 wxmeng04

Sorry, I did't unstand your question first. It's possible to use keepalived to use virtual ip as read-write interface.

wxmeng04 avatar Nov 13 '24 03:11 wxmeng04

Thank you for the update. You mean to use virtual ip as node ip and configure it in pg auto failover?

adnanhamdussalam avatar Nov 13 '24 06:11 adnanhamdussalam

You can include some script to detect the primary node and bind the virtual ip on the primary. Find below configuration file I'm using. Frankly I'm using keepalived with patroni right now, though I don't like python for this kind of application.

global_defs {
    router_id LVS_DEVEL
    vrrp_skip_check_adv_addr
    vrrp_garp_interval 0
    vrrp_gna_interval 0
}

vrrp_script check_postgresql {
    script "/usr/bin/su - postgres -c \"pg_controldata /var/lib/pgsql/17/data | grep 'in production'\""
    user root
    interval 2
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 52
    nopreempt
    priority 200
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 123456
    }
    unicast_src_ip 192.168.0.7
    unicast_peer {
        192.168.0.8
    }
    virtual_ipaddress {
        192.168.0.20/27 dev eth0 label eth0:1
    }
    garp_master_delay 1
    garp_master_refresh 5

    track_script {
        check_postgresql
    }
}

wxmeng04 avatar Dec 02 '24 10:12 wxmeng04

I'm also experimenting with that. What I'm testing right now:

haproxy.cfg

global
    daemon
    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats

defaults
    mode                    tcp
    log                     global
    retries                 3
    timeout queue           1m
    timeout connect         1s
    timeout client          3600s
    timeout server          3600s
    timeout check           2s
    maxconn                 500

frontend stats
    mode http
    bind *:8182
    stats enable
    stats uri /stats
    stats refresh 10s
    stats auth admin:password
    stats uri  /haproxy?stats

frontend db
    bind *:5432
    mode tcp
    default_backend postgres

backend postgres
    mode tcp
    balance roundrobin
    option httpchk
    http-check send meth GET uri /health
    http-check expect status 200

    server db1 primary:5432 check port 5433
    server db2 standby:5432 check port 5433 backup

/us/local/bin/pgcheck

#!/bin/bash
#
# This script checks if a PostgreSQL server is healthy running on localhost.
# credits to several9s: https://severalnines.com/blog/postgresql-load-balancing-using-haproxy-keepalived/,
# https://www.percona.com/blog/postgresql-application-connection-failover-using-haproxy-with-xinetd/

export PGDATABASE='postgres'
export PGCONNECT_TIMEOUT=10

STANDBY_CHECK="SELECT pg_is_in_recovery()"
WRITABLE_CHECK="SHOW transaction_read_only"

# Check the status of PostgreSQL
READY=$(pg_isready 2> /dev/null)
if [ $? -ne 0 ]; then
    exit 1
fi

# check if in recovery mode (that means it is a 'standby')
STANDBY=$(psql -qt -c "$STANDBY_CHECK" 2>/dev/null)
if [ $? -ne 0 ]; then
    exit 1
elif echo $STANDBY | egrep -i "(t|true|on|1)" 2>/dev/null >/dev/null; then
    echo -e "Status: 202 Accepted\n\nstandby\n"
    exit 0
fi

# check if writable (then we consider it as a 'primary')
READONLY=$(psql -qt -c "$WRITABLE_CHECK" 2>/dev/null)
if [ $? -ne 0 ]; then
    exit 1
elif echo $READONLY | egrep -i "(f|false|off|0)" 2>/dev/null >/dev/null; then
    echo -e "Status: 200 OK\n\nprimary\n"
    exit 0
fi

exit 1
shell2http -no-index -port="$HEALTHCHECK_PORT" -500 -log=/dev/null -cgi /health pgcheck &

What it does: every 10s the url /health is requested. If it's primary it returns 200. For standby it returns HTTP Status 202 and for down 500. Only 200 Status is UP.

It does work OK. However, I think it's better to use multi host URL in the client.

dantio avatar Feb 02 '25 13:02 dantio

Hello @dantio , could you explain the reasons why you think it's better to use multi host URL in the client (libpq) performances issue ?

Some editors applications may not be able to use libpq multi host connexions. in these cases, a KE ou HAproxy solution could be fine.

I've experimented antother problem with psql && multi host connexions : If the first server of the connexion string is down (network issue, powerOff,...), then it will wait for PGCONNECT_TIMEOUT sec (default is more than 2mins, min is 1s)

# without setting PGCONNECT_TIMEOUT
$ time psql "postgres://vpostgres-test1-ope:8433,vpostgres-test2-ope:8433/postgres?target_session_attrs=read-write&sslmode=require" -U postgres -Aqtc "select  pg_read_file('/etc/hostname');"
vpostgres-test2-ope

real	2m9,511s
user	0m0,037s
sys	0m0,004s



# setting PGCONNECT_TIMEOUT to 1s
$ export  PGCONNECT_TIMEOUT=1
$ time psql "postgres://vpostgres-test1-ope:8433,vpostgres-test2-ope:8433/postgres?target_session_attrs=read-write&sslmode=require" -U postgres -Aqtc "select  pg_read_file('/etc/hostname');"
vpostgres-test2-ope

real	0m1,054s
user	0m0,036s
sys	0m0,004s


# Using connect_timeout in connexion string
$ time psql "postgres://vpostgres-test1-ope:8433,vpostgres-test2-ope:8433/postgres?target_session_attrs=read-write&sslmode=require&connect_timeout=1" -U postgres -Aqtc "select  pg_read_file('/etc/hostname');"
vpostgres-test2-ope

real	0m1,056s
user	0m0,031s
sys	0m0,014s

so each in this case (host1 offline), each connection takes 1s (min) to proceed... which is "not really" performant... That's why I look for KE / Hap /.. solutions...

PatrickBzh avatar Apr 09 '25 16:04 PatrickBzh

@PatrickBzh With multi host URL you don't have another layer between the App and the DB. Therefore, I was thinking that this solution is better, since lightweight. Since the DB switchover/failover should only seldom happen, 1sec is OK I think...

However, our Clients do not support multi Host URL and we actually never used is so far. We have adjusted the haproxy a litte:

defaults
    mode                    tcp
    log                     global
#    retries                 3 # retry forever
    timeout queue           1m
    timeout connect         4s
    timeout client          10m
    timeout server          10m
    timeout check           2s
    maxconn                 2000

resolvers docker
    nameserver dns1 127.0.0.11:53
    resolve_retries 3
    timeout resolve 1s
    timeout retry   1s
    hold other      10s
    hold refused    10s
    hold nx         10s
    hold timeout    10s
    hold valid      10s
    hold obsolete   10s

frontend stats
    mode http
    bind *:8182
    stats enable
    stats refresh 10s
    stats auth haproxy:password
    stats uri  /haproxy?stats

frontend db
    bind *:5432
    mode tcp
    default_backend postgres

backend postgres
    mode tcp
    balance first
    option httpchk GET /health
    http-check expect status 200
    # 5433 is the health check port
    server master    db_master:5432     check inter 3s check resolvers docker init-addr libc,none port 5433 on-marked-down shutdown-sessions
    server backup db_standby:5432 check inter 3s check resolvers docker init-addr libc,none port 5433 on-marked-down shutdown-sessions

We are happy with this config. Switchover/Failover happens immediately in our Clients (NestJS+Slonik)

dantio avatar Apr 10 '25 13:04 dantio

We have been running his config in production for a while now:

https://github.com/neuroforgede/pg_auto_failover_ansible/wiki/HAProxy

s4ke avatar Apr 10 '25 16:04 s4ke

@s4ke nice!

I have really found some gems in NeuroForge repositories. Thank you for contributing to OS👍🏼

dantio avatar Apr 10 '25 16:04 dantio