containers icon indicating copy to clipboard operation
containers copied to clipboard

[bitnami/postgresql-repmgr] - Postgresql HA - solution for slip brain - multi master

Open martinfar opened this issue 3 years ago • 7 comments

Description of the change

This change incorporate the case for a failing primary Node with persistent volume attached to the Node. That create a problem . Because the primary cannot move to another node. So, it can occur that the replicas don't have a upstream available. So I incorporate the possibility for a standby node to self promote it self in case the Primary is more than REPMGR_WAIT_PRIMARY_TIME unavailable.

I also put random alterations to some waiting times to the standby replicas came at different point in time.

Benefits

The benefits:
  • more resilient cluster in case of definite failure of a primary node
  • more control over the amount of time we wait for the primary. In case the node is busy restarting. This prevent the famous split brain Reference: https://github.com/bitnami/charts/pull/9035

Possible drawbacks

  • too much waiting time for a new primary.

Applicable issues

https://github.com/bitnami/charts/pull/9035

Additional information

Its important to apply this improvement with a Liveness probe with at least a bigger period for REPMGR_WAIT_PRIMARY_TIME , plus the time of the node setup.

martinfar avatar Aug 01 '22 18:08 martinfar

Hi Team, Was it possible for you to review this merge ?

martinfar avatar Aug 10 '22 17:08 martinfar

Hi, I wanted to know if the team was able to review this pull request ?

martinfar avatar Aug 26 '22 17:08 martinfar

This Pull Request has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thank you for your contribution.

github-actions[bot] avatar Sep 11 '22 01:09 github-actions[bot]

Hi Bitnami team. Where you able to review the proposed changes ?

martinfar avatar Sep 11 '22 01:09 martinfar

Why don't you just add the possibility to configure/create nodes of the type witness? I thought witness nodes are just made for this purpose to prevent split brains (https://repmgr.org/docs/repmgr.html#REPMGR-CONCEPTS / https://repmgr.org/docs/repmgr.html#REPMGR-WITNESS-REGISTER)? Currently only primary and standby nodes are possible to configure.

seriouz avatar Sep 14 '22 14:09 seriouz

@seriouz adding the witness is a major surgery to this development. However with the changes in this commit I was able to support a down datacenter without downtime. I am running the postgres cluster in production.

martinfar avatar Sep 15 '22 00:09 martinfar

@migruiz4 I've already made the changes. Can you let me know if you need something else ?

martinfar avatar Sep 15 '22 00:09 martinfar

This Pull Request has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thank you for your contribution.

github-actions[bot] avatar Oct 05 '22 01:10 github-actions[bot]

Hi bitnami team, is it posible to go along with this merge?

martinfar avatar Oct 05 '22 05:10 martinfar

This Pull Request has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thank you for your contribution.

github-actions[bot] avatar Oct 22 '22 01:10 github-actions[bot]

Hi Bitnami team, any news about this request ?

martinfar avatar Oct 22 '22 04:10 martinfar

Hi @martinfar, I'm sorry for the late response.

After some in-depth evaluation of this proposal, I'm afraid we can not merge these changes as they are because of the following reasons:

  1. We can not change the reconnect_interval to a random value as it is not deterministic (different deployments could behave differently). https://github.com/bitnami/containers/blob/beb5b0b050c0cb93ac756708339d99aea8bd1f21/bitnami/postgresql-repmgr/10/debian-11/rootfs/opt/bitnami/scripts/librepmgr.sh#L537-L539
  2. The same applies to the changes proposed on the repmgr_wait_primary_node function. https://github.com/bitnami/containers/blob/beb5b0b050c0cb93ac756708339d99aea8bd1f21/bitnami/postgresql-repmgr/10/debian-11/rootfs/opt/bitnami/scripts/librepmgr.sh#L611-L615
  3. The new REPMGR_PRIMARY_ROLE_LOCK_FILE_NAME could introduce additional corner cases, and its functionality is unclear. I would like to share my insights on it. Considering this is the container setup.sh:
    # Ensure PostgreSQL & repmgr environment variables settings are valid
    repmgr_validate
    postgresql_validate
    
    # Set the environment variables for the node's role
    eval "$(repmgr_set_role)"
    
    # Ensure PostgreSQL is stopped when this script ends.
    trap "postgresql_stop" EXIT
    # Ensure 'daemon' user exists when running as 'root'
    am_i_root && ensure_user_exists "$POSTGRESQL_DAEMON_USER" --group "$POSTGRESQL_DAEMON_GROUP"
    # Prepare PostgreSQL default configuration
    repmgr_postgresql_configuration
    # Prepare repmgr configuration
    repmgr_generate_repmgr_config
    # Initialize PostgreSQL & repmgr
    repmgr_initialize
    
    What the proposed changes consist of:
    • The function repmgr_wait_primary_node has been modified to retry the upstream connection to obtain a primary node.
    • At eval "$(repmgr_set_role)", the container will try to connect to the upstream primary, and if a connection can't be established, then the node will start as primary if it is marked to, otherwise will start as standby. The proposed change is to run the modified repmgr_wait_primary_node, and if it fails, create the 'promote standby' lock file. This is executed either if the node will be assigned the primary or standby role, as far as it wasn't a primary before. If no upstream nodes can be reached, the lock file will exist.
    • During the repmgr_initialize, if the node has role 'primary' and it is not a first boot, the command standby promote will be executed. I'm afraid this logic is too complicated and would cause the troubleshooting to be more complicated. Additionally, the objective of the promotion lock file is unclear, in what scenario would it be helpful?

With the current proposed changes, looks like the objective is to run the promote command if the following conditions are met:

  • Node wasn't a primary node in the previous execution.
  • Upstream couldn't be reached.
  • Node is marked as initial primary during current execution.

But that raises two additional questions:

  • Is a lock file really necessary? Do we need to persist it between executions?
  • Considering we would be promoting Standby nodes if they are isolated from upstream, wouldn't that cause even more split-brain scenarios?

It would be helpful if you could provide more information about in which scenarios would this feature would be beneficial, otherwise, we may have to close this PR.

migruiz4 avatar Oct 24 '22 16:10 migruiz4

Hi @migruiz4 . Thanks for the feedback. I will analyse your suggestion

martinfar avatar Oct 24 '22 16:10 martinfar

@martinfar If there is anything we could do to help you please let us know.

migruiz4 avatar Oct 27 '22 09:10 migruiz4

@martinfar If there is anything we could do to help you please let us know.

@migruiz4 In fact, if you can help me, I would like to know if have a better idea to remove the random time periods for a way where each will verify in a different time after the primary is unavailable. The idea is to avoid that two independent nodes decide to upgrade themselves to primary at the same. That will cause a split brain situation.

martinfar avatar Oct 29 '22 17:10 martinfar

The idea is to avoid that two independent nodes decide to upgrade themselves to primary at the same. That will cause a split brain situation.

The best way to prevent split-brain is through quorum, not sure if this is the case, but I would always recommend deploying an odd number of nodes.

Normally, deploying an odd number of nodes (n), should be enough to prevent split-brain scenarios as (n / 2 + 1) nodes need to agree for a new primary to be promoted.

The witness node function is to force quorum in a 2-node deployment, where the witness node votes on the primary node election, but does not participate in the replication.

migruiz4 avatar Nov 09 '22 10:11 migruiz4

The witness node function is to force quorum in a 2-node deployment, where the witness node votes on the primary node election, but does not participate in the replication.

I really think that a witness node is the solution to all of this... where is this feature in the roadmap ? I mean in this bitnami project Postgres HA

martinfar avatar Nov 10 '22 23:11 martinfar

The witness node function is to force quorum in a 2-node deployment, where the witness node votes on the primary node election, but does not participate in the replication.

I really think that a witness node is the solution to all of this... where is this feature in the roadmap ? I mean in this bitnami project Postgres HA

Are thinking of something like this? https://github.com/bitnami/containers/issues/6663

seriouz avatar Nov 11 '22 10:11 seriouz

The witness node function is to force quorum in a 2-node deployment, where the witness node votes on the primary node election, but does not participate in the replication.

I really think that a witness node is the solution to all of this... where is this feature in the roadmap ? I mean in this bitnami project Postgres HA

Are thinking of something like this? #6663

Yes the idea is to implement a witness as you suggested in that issue

martinfar avatar Nov 26 '22 00:11 martinfar

This Pull Request has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thank you for your contribution.

github-actions[bot] avatar Dec 11 '22 01:12 github-actions[bot]

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Pull Request. Do not hesitate to reopen it later if necessary.

github-actions[bot] avatar Dec 17 '22 01:12 github-actions[bot]