Name and Version

bitnami/postgresql-ha 6.5.5

What architecture are you using?

None

What steps will reproduce the bug?

Deploy PostgreSQL with HA architecture (using Replication Manager (repmgr) and Pgpool using bitnami helm charts Postgresql replicas=2 and Pgpool replicas=2
C++ application connected to DB via Pgpool using the libpq library and executing SELECT, INSERT and Update queries.
Occasionally, when a master PostgreSQL node switches to slave and slave node to master due to a restart or any other reason, Pgpool may continue to send all queries to the previous master, which is now acting as a slave. This results in the following error:ERROR: cannot execute INSERT in a read-only transaction This error occurs because Pgpool attempts to execute INSERT and UPDATE queries on a PostgreSQL instance that is in a read-only state.

Are you using any custom parameters or values?

----------------------------
pgPool-II configuration file
----------------------------

backend_clustering_mode = 'streaming_replication'

listen_addresses = '*'
port = '5432'
socket_dir = '/opt/build/pgpool/tmp'
reserved_connections = 0



pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/opt/build/pgpool/tmp'
listen_backlog_multiplier = 2
serialize_accept = off


backend_hostname0 = 'host1'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_application_name0 = 'server0'


enable_pool_hba = 'on'
authentication_timeout = '30'

allow_clear_text_frontend_auth = 'off'


ssl = 'on'
ssl_key = '/opt/build/pgpool/certs/tls.key'
ssl_cert = '/opt/build/pgpool/certs/tls.crt'

ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'
ssl_prefer_server_ciphers = off
ssl_ecdh_curve = 'prime256v1'
ssl_dh_params_file = ''



num_init_children = '200'
max_pool = '15'


child_life_time = 5min
child_max_connections = 0
connection_life_time = 0
client_idle_limit = 0




log_destination = 'stderr'



log_connections = 'off'
log_disconnections = off
log_hostname = 'on'
log_statement = off
log_per_node_statement = 'off'
log_client_messages = off
log_standby_delay = 'if_over_threshold'


syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'



pid_file_name = '/opt/build/pgpool/tmp/pgpool.pid'
logdir = '/opt/build/pgpool/logs'



connection_cache = on

reset_query_list = 'ABORT; DISCARD ALL'



replicate_select = off

insert_lock = off
lobj_lock_table = ''


replication_stop_on_mismatch = off

failover_if_affected_tuples_mismatch = off



load_balance_mode = 'off'
ignore_leading_white_space = on
read_only_function_list = ''
write_function_list = ''

primary_routing_query_pattern_list = ''

database_redirect_preference_list = ''

app_name_redirect_preference_list = ''
allow_sql_comments = off

disable_load_balance_on_write = 'transaction'

dml_adaptive_object_relationship_list= ''

statement_level_load_balance = 'off'



sr_check_period = '30'
sr_check_user = 'repmgr'

sr_check_database = 'postgres'
delay_threshold = 10000000


follow_primary_command = ''


health_check_period = '30'
health_check_timeout = '10'
health_check_user = 'repmgr'

health_check_database = ''
health_check_max_retries = '5'
health_check_retry_delay = '5'
connect_timeout = 10000



failover_command = 'echo ">>> Failover - that will initialize new primary node search!"'
failback_command = ''

failover_on_backend_error = 'off'

detach_false_primary = off

search_primary_node_timeout = '0'


recovery_user = 'nobody'
recovery_password = ''

recovery_1st_stage_command = ''
recovery_2nd_stage_command = ''
recovery_timeout = 90
client_idle_limit_in_recovery = 0

auto_failback = off
auto_failback_interval = 1min



use_watchdog = off


trusted_servers = ''
ping_path = '/bin'


hostname0 = ''
wd_port0 = 9000
pgpool_port0 = 9999



wd_priority = 1

wd_authkey = ''

wd_ipc_socket_dir = '/tmp'



delegate_IP = ''
if_cmd_path = '/sbin'
if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev eth0 label eth0:0'
if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev eth0'
arping_path = '/usr/sbin'
arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I eth0'


clear_memqcache_on_escalation = on
wd_escalation_command = ''
wd_de_escalation_command = ''


failover_when_quorum_exists = on

failover_require_consensus = on

allow_multiple_failover_requests_from_node = off


enable_consensus_with_half_votes = off




wd_lifecheck_method = 'heartbeat'
wd_interval = 10


heartbeat_hostname0 = ''
heartbeat_port0 = 9694
heartbeat_device0 = ''


wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30


wd_life_point = 3
wd_lifecheck_query = 'SELECT 1'
wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'nobody'
wd_lifecheck_password = ''

relcache_expire = 0
relcache_size = 256

check_temp_table = catalog

check_unlogged_table = on
enable_shared_relcache = on

memory_cache_enabled = off
memqcache_method = 'shmem'
memqcache_memcached_host = 'localhost'
memqcache_memcached_port = 11211
memqcache_total_size = 64MB
memqcache_max_num_cache = 1000000
memqcache_expire = 0
memqcache_auto_cache_invalidation = on
memqcache_maxcache = 400kB
memqcache_cache_block_size = 1MB
memqcache_oiddir = '/var/log/pgpool/oiddir'
cache_safe_memqcache_table_list = ''
cache_unsafe_memqcache_table_list = ''
backend_hostname0 = 'postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.ns1.svc.cluster.local'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/opt/build/pgpool/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = 'postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.ns1.svc.cluster.local'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/opt/build/pgpool/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'

What is the expected behavior?

Pgpool should consistently direct queries to the master node and promptly switch to the new master node whenever a change occurs.

What do you see instead?

Pgpool doesn't switch to the master node until we restart pgpool deployment.

Feb 07 '24 09:02 rohitkgupta

-----------------------------

PostgreSQL configuration file

-----------------------------

listen_addresses = '*' port = '5432' max_connections = '512'

ssl = 'on' ssl_cert_file = '/certs/tls.crt' ssl_key_file = '/certs/tls.key'

wal_level = 'hot_standby' fsync = 'on'

max_wal_size = '400MB'

archive_mode = 'on' archive_command = '/bin/true'

max_wal_senders = '16' wal_keep_size = '128MB'

max_replication_slots = '10'

primary_conninfo = 'host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.ns1.svc.cluster.local port=5432 user=repmgr promote_trigger_file = '/tmp/postgresql.trigger.5432' hot_standby = 'on'

logging_collector = 'on'

log_directory = '/opt/build/postgresql/logs' log_filename = 'postgresql.log'

log_connections = 'false' log_disconnections = 'false' log_hostname = 'true'

client_min_messages = 'error'

shared_preload_libraries = 'repmgr, pgaudit, repmgr'

include_dir = 'conf.d'

pgaudit.log_catalog = 'off'

Feb 07 '24 09:02 rohitkgupta

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

Feb 23 '24 01:02 github-actions[bot]

ping

Feb 25 '24 05:02 rohitkgupta

Hi @rohitkgupta,

I'm sorry for the late response.

I may need more information to help you with this issue. Does your issue persist over time when the master/slave switch happens? Or is this issue transient?

Please notice that short downtimes may happen when master is switched because in case of failure, the chart behavior would be the following:

Repmgr will notice the Master node is not available and select a new one.
Pgpool-II health_check will detect the master node is not available and stop accepting new requests.
Because of new request requests being refused, first the readiness probe will fail and pgpool-II will no longer receive traffic.
Eventually, pgpool-II will be restarted by the liveness probe failure, connecting to the new master node.

If the error ERROR: cannot execute INSERT in a read-only transaction is indeed transient, I would recommend fine-tunning both the pgpool-II healthcheck settings and the chart liveness/readiness probe settings for quicker recovery from failure.

Feb 27 '24 12:02 migruiz4

@migruiz4 Thanks for your response. It is not a transient issue. Pgpool didn't restart and there were no probe failure for many hours, we had to restart Pgpool manually in this case.

Feb 28 '24 05:02 rohitkgupta

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

Mar 15 '24 01:03 github-actions[bot]

same issue for me with pgpool:4.5.1-debian-12-r0 downgrade to 4.5.0-debian-12-r10 to fix

Mar 17 '24 21:03 BK-STAR77

Thank you very much for your feedback @BK-STAR77!

The container logic has not received any change from 4.5.0 to 4.5.1 that could cause this behavior. Checking pgpool release notes, it appears that it could be related to the change of the default value for failover_on_backend_shutdown.

Could you please confirm if setting failover_on_backend_shutdown=on fixes the issue? If that is the case, we should set it as the default value because that may be desired for Kubernetes.

Mar 21 '24 13:03 migruiz4

I have same issue here and reverting manually to 4.5.0-debian-12-r10 in deployment is working now... but, this could be very dangerous for production...

Mar 27 '24 12:03 apoca

Thank you very much for your feedback @BK-STAR77!

The container logic has not received any change from 4.5.0 to 4.5.1 that could cause this behavior. Checking pgpool release notes, it appears that it could be related to the change of the default value for failover_on_backend_shutdown.

Could you please confirm if setting failover_on_backend_shutdown=on fixes the issue? If that is the case, we should set it as the default value because that may be desired for Kubernetes.

Is it a variable in the values.yaml ?

Mar 27 '24 12:03 BK-STAR77

I don't see this variable in values.yaml :/

Mar 28 '24 20:03 apoca

We are experiencing very bad failover issues also.

We updated to the newer charts from waaaaaay back and the latest charts appear to not failover correctly at all. We are using kubernetes.

We deploy then deploy our other pods to connect and all is fine. Next we delete the primary node pod using kubectl delete pod everything "appears" fine, but then we delete our other pods connected and they start a db-migration using liquibase and we get this error.

with a corresponding error in pgpool of the following

There have been other errors too that state it is in read-only mode

Caused by: org.postgresql.util.PSQLException: ERROR: cannot execute UPDATE in a read-only transaction

We have attempted to downgrad the image of pgpool to 4.5.0 while still using the current charts but pgpool will fail to start with the following error.

cp: cannot stat '/opt/bitnami/pgpool/etc/pgpool.conf.sample': No such file or directory

So currently, it appears the pg 16 chart version 14+ do not properly failover which makes HA useless and dangerous.

We were hoping to update since previous issues with pgpool and the chart caused pgpool to restart beacause of liveness probes and those updates to the probe script were added and we hoped all was fixed. There was also a previous issue where the DB would go into split brain pretty often and that was super dangerous because it wouldn't fail and you would have one outdated db and one current and it would read from both... so we had to shut off loadbalancing.

Edit:

After restarting PGPool they all EVENTUALLY came back up. Will test again to see if it works eventually WITHOUT restart and see if it replicates again.

I am also pouring through the other issues to see if any are related.

Apr 08 '24 15:04 Stevenpc3

@Stevenpc3 Exactly the same thing is happening to me, but as I only started using postgresql-ha about a month ago I thought it was a problem with some bad configuration on my part. But I had this problem for weeks, and only by deleting the extra primary one did it start to work, because it recreated everything, but automatically it doesn't work. In my case I ended up having different random data and I noticed that I had two primary.

Apr 08 '24 15:04 apoca

@apoca that issue you are describing is split-brain https://github.com/bitnami/charts/issues/20998 more on the internet about it too. The issue I mentioned at the bottom of my comment is more like replication stops working with no errors or issues detected and is super dangerous and makes failover pointless.

Both of those above issues are not related to the current issue in this thread though. This thread appears to be an issue with PGPool and/or the bitnami chart configs that use it.

Apr 08 '24 16:04 Stevenpc3

You can add the flag

postgresql-ha:
  pgpool:
    configuration: |
      connection_cache = false
      failover_on_backend_error = on
      failover_on_backend_shutdown = on

Apr 09 '24 16:04 Stevenpc3

You can add the flag

postgresql-ha:
  pgpool:
    configuration: |
      connection_cache = false
      failover_on_backend_error = on
      failover_on_backend_shutdown = on

Is the "postgresql-ha:" name correct? My config file has the name as "postgresql:" - also, the pgpool: section is a top-level item in the values.yaml... Sorry if I'm missing something here - I've just started using this.

Is it not meant to be:

postgresql:
...
pgpool:
  configuration: |-
    connection_cache = false
    failover_on_backend_error = on
    failover_on_backend_shutdown = on

?

Apr 20 '24 22:04 RaddishIoW

My apologies, your way is correct for using the chart directly. I am using a wrapper chart. We have a chart that uses this chart as a dependency so we can template a few things.

Apr 20 '24 22:04 Stevenpc3

Since you talk about this, I have a question here... I have software as a service and I create a schema database and a user/password as well as the necessary permissions in real time... Sometimes I have a problem with Kind because saying that the database does not exist (but it exists) my question is whether this “false” cache solves this type of problem in pgpool?

Apr 20 '24 22:04 apoca

same issue here vanilla install postgresql-ha-14.0.2 16.2.0

Apr 23 '24 01:04 phntom

Hi there! We have released a new version of the bitnami/pgpool container setting the default value for failover_on_backend_shutdown=on (https://github.com/bitnami/containers/pull/65526).

This patched image is included in the postgresql-ha Helm chart starting version 14.0.6. Could you please confirm if the issue persists after upgrading the chart to used the patched pgpool image?

May 07 '24 10:05 migruiz4

@migruiz4 I am using the last version 14.0.10 and I had this error today, and after rollout again pgpool it's working again... This is very serious issue and I won't put this for production mode.

May 07 '24 14:05 apoca

@apoca Could you please provide a detailed description of your issue? We need additional information in order to troubleshoot and fix this issue

May 07 '24 14:05 migruiz4

We also tried setting these settings manually and could not get proper fail over to work. I will be doing more testing tomorrow of the charts and try to write a more detailed example of the setup, test, and results.

What is the fail over testing Bitnami has done to test this?

It would be nice for us to know and replicate the procedure used to validate the chart and operation to compare.

The basic test we tried and others seem to try is:

Deploy the chart with 3 instances.
Have a service connect through pgpool that does read and write operations.
Delete the primary node pod. Expect a fail over and then the service should still be able to use the database with the new primary.

Instead the service can't because pgpool didn't fail over, or if it did, it thinks the new primary is read only.

May 08 '24 01:05 Stevenpc3

Hi @rohitkgupta,

I'm sorry for the late response.

I may need more information to help you with this issue. Does your issue persist over time when the master/slave switch happens? Or is this issue transient?

Please notice that short downtimes may happen when master is switched because in case of failure, the chart behavior would be the following:

Repmgr will notice the Master node is not available and select a new one.

Pgpool-II health_check will detect the master node is not available and stop accepting new requests.

Because of new request requests being refused, first the readiness probe will fail and pgpool-II will no longer receive traffic.

Eventually, pgpool-II will be restarted by the liveness probe failure, connecting to the new master node.

If the error ERROR: cannot execute INSERT in a read-only transaction is indeed transient, I would recommend fine-tunning both the pgpool-II healthcheck settings and the chart liveness/readiness probe settings for quicker recovery from failure.

From this description, would you state that you EXPECT pgpool to fail liveness probe and subsequently restart as part of a normal and proper fail over?

We have a case where we do not want pgpool to restart. I would hope it was possible for a fail over without restart of pgpool.

May 08 '24 01:05 Stevenpc3

Hi @Stevenpc3, the tests we execute for postgresql-ha can be found here: https://github.com/bitnami/charts/tree/main/.vib/postgresql-ha.

From this description, would you state that you EXPECT pgpool to fail liveness probe and subsequently restart as part of a normal and proper fail over?

Yes, before upgrading to pgpool-II version 4.5.0 to 4.5.1, when using the bitnami/postgresql-ha chart with default values, short downtimes would be expected if the Postgresql primary node goes down.

Because requirements may vary from one user to another, we usually encourage users fine-tune pgpool-II settings, repmgr settings and liveness/readiness timers to reduce those downtimes adapted to their use case.

The bug we are addressing in this issue is the behavior OP reported, where on primary failure pgpool-II would never establish a connection to the primary node and all client connections would fail until a manual restart is performed, which is a much severe issue.

May 08 '24 06:05 migruiz4

Yes I agree it's a severe issue, we have that issue too which is why I asked

"How did you test before publishing" and not see this? Was fail over part of the normal test?

As for the pgpool liveness probe, you added a full restart of pgpool as normal operation. This is off to me. Would you expect pgpool to be restarted in a bare metal deploy? Why is pgpool expected to restart in kubernetes?

May 08 '24 12:05 Stevenpc3

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

May 24 '24 01:05 github-actions[bot]

bump

May 24 '24 02:05 Stevenpc3

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

May 30 '24 01:05 github-actions[bot]

I have created an internal task to fix this issue and added the on-hold label to prevent the stale bot while this is still relevant.

I'm sorry but we are a small team and have to deal with other priorities at the moment, if you would like to help us find a fix for this issue and contribute to the project, please send a pull request and we will be happy to review it.

May 30 '24 13:05 migruiz4

charts
charts copied to clipboard

[bitnami/postgresql-ha] ERROR: cannot execute INSERT in a read-only transaction

Name and Version

What architecture are you using?

What steps will reproduce the bug?

Are you using any custom parameters or values?

What is the expected behavior?

What do you see instead?

-----------------------------

PostgreSQL configuration file

-----------------------------

Edit:

charts charts copied to clipboard

[bitnami/postgresql-ha] ERROR: cannot execute INSERT in a read-only transaction

Name and Version

What architecture are you using?

What steps will reproduce the bug?

Are you using any custom parameters or values?

What is the expected behavior?

What do you see instead?

-----------------------------

PostgreSQL configuration file

-----------------------------

Edit:

charts
charts copied to clipboard