postgresql_cluster icon indicating copy to clipboard operation
postgresql_cluster copied to clipboard

PITR: Cluster stuck in running state

Open BabaTataKaka opened this issue 1 year ago • 8 comments

I'm testing the PITR and running the following .

ansible-playbook eploy_cluster.yml --tags point_in_time_recovery

I've set the following value in the main.yml:

patroni_cluster_bootstrap_method: "pgbackrest" 
patroni_create_replica_methods:
  #Uncomment pgbackrest only when restoring DB
  - pgbackrest
  - basebackup
postgresql_restore_command: "pgbackrest --stanza={{ pgbackrest_stanza }} archive-get %f %p" # restore WAL-s using pgbackrest
 '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --type=time "--target=2024-07-31 13:30:00" --delta restore'  # Point-in-Time Recovery (example)

The restore is successfully on the master only(replication fail?) and the state in patroni is stuck at running:

"+ Cluster: postgres-cluster (7397789376619538656) ---+-----------+", "| Member | Host | Role | State | TL | Lag in MB |", "+----------+----------------+---------+---------+----+-----------+", "| svr001 | svr001.dh.pa | Leader | running | 3 | |", "| svr002 | svr002.dh.pa | Replica | running | 1 | 0 |", "| svr003 | svr003.dh.pa.ll | Replica | running | 1 | 0 |", "+----------+----------------+---------+---------+----+-----------+"

Postgres master logs:

2024-07-31 13:55:13 UTC [34061-4] 10.136.16.118(38560) replicator@[unknown] ERROR:  requested starting point 0/8000000 on timeline 1 is not in this server's history
2024-07-31 13:55:13 UTC [34061-5] 10.136.16.118(38560) replicator@[unknown] DETAIL:  This server's history forked from timeline 1 at 0/7000170.
2024-07-31 13:55:13 UTC [34061-6] 10.136.16.118(38560) replicator@[unknown] STATEMENT:  START_REPLICATION SLOT "svr002" 0/8000000 TIMELINE 1


Postgres replicas logs:

2024-07-31 14:10:23 UTC [31599-1]  FATAL:  could not start WAL streaming: ERROR:  requested starting point 0/8000000 on timeline 1 is not in this server's history
	DETAIL:  This server's history forked from timeline 1 at 0/7000170.

This is from a PITR from an existing cluster. Looks like its working when restoring from a new cluster(no point_in_time_recovery tag). Looks like a problem wit the timeline but not sure why. There is no error in patroni or pgbackrest logs on both the master and replicas.

BabaTataKaka avatar Jul 31 '24 14:07 BabaTataKaka

Hi @BabaTataKaka

It looks like you have a problem in the backup repository, perhaps several different clusters performed archiving to the same repository (one stanza).

This question is not related to the automation issue, but rather to the maintenance issue.

vitabaks avatar Jul 31 '24 15:07 vitabaks

@BabaTataKaka Hi, please tell me, what kind of archive command do you have? And do you have backup full/diff?

SDV109 avatar Jul 31 '24 15:07 SDV109

@BabaTataKaka Hi, please tell me, what kind of archive command do you have? And do you have backup full/diff?

pgbackrest --stanza={{ pgbackrest_stanza }} archive-push %p

Full backup

BabaTataKaka avatar Jul 31 '24 16:07 BabaTataKaka

Hi @BabaTataKaka

It looks like you have a problem in the backup repository, perhaps several different clusters performed archiving to the same repository (one stanza).

This question is not related to the automation issue, but rather to the maintenance issue.

This is from a new cluster. Here what i did : 1-Deployed the cluster. 2-Entered data in DB 3-Ran backup from pgbackrest server 4- Deleted data from DB 5- Ran point_in_time_recovery.

Based on my research and like you mentioned, this issue actually happen when you try to restore a db to a different cluster. However, in this case, its from the same cluster. So , i was wondering may be I'm not using the playbook correctly?

BabaTataKaka avatar Jul 31 '24 16:07 BabaTataKaka

@BabaTataKaka, I see you have --target=2024-07-31 13:30:00, what time did you launch point_in_time_recovery? There is an assumption that the necessary WALs for time recovery did not have time to be on the pgbackrest server. Try to recover first with the help of:

pgbackrest_patroni_cluster_restore_command: '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --delta restore' # restore from latest backup

After that, try to do point_in_time_recovery in 1 hour. For example: You need to recover at 13:30, and start it at 14:30

SDV109 avatar Jul 31 '24 16:07 SDV109

Please take a moment to read about the error "requested starting point XXXXXX on timeline X is not in this server's history" on the Internet for better understanding.

Additionally, consider the following suggestions:

  • If this is a test environment, you might want to clear the backup directory.
  • Alternatively, you can create a new stanza, perform a fresh backup, and then attempt the recovery again.
  • Or, try with basebackup only in patroni_create_replica_methods. To restore Primary from a backup and re-initialize replicas using pg_basebackup.

vitabaks avatar Jul 31 '24 17:07 vitabaks

@BabaTataKaka, I see you have --target=2024-07-31 13:30:00, what time did you launch point_in_time_recovery? There is an assumption that the necessary WALs for time recovery did not have time to be on the pgbackrest server. Try to recover first with the help of:

pgbackrest_patroni_cluster_restore_command: '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --delta restore' # restore from latest backup

After that, try to do point_in_time_recovery in 1 hour. For example: You need to recover at 13:30, and start it at 14:30

I lauched the recoverdy 2-3 mins after. I will try your suggestion. Thanks.

BabaTataKaka avatar Jul 31 '24 17:07 BabaTataKaka

Please take a moment to read about the error "requested starting point XXXXXX on timeline X is not in this server's history" on the Internet for better understanding.

Additionally, consider the following suggestions:

* If this is a test environment, you might want to clear the backup directory.

* Alternatively, you can create a new stanza, perform a fresh backup, and then attempt the recovery again.

* Or, try with `basebackup` only in `patroni_create_replica_methods`. To restore Primary from a backup and re-initialize replicas using pg_basebackup.

Please take a moment to read about the error "requested starting point XXXXXX on timeline X is not in this server's history" on the Internet for better understanding.

Additionally, consider the following suggestions:

* If this is a test environment, you might want to clear the backup directory.

* Alternatively, you can create a new stanza, perform a fresh backup, and then attempt the recovery again.

* Or, try with `basebackup` only in `patroni_create_replica_methods`. To restore Primary from a backup and re-initialize replicas using pg_basebackup.

I did , but I guess I'll have to dig a lot deeper. Thanks for your suggestion. I will try SVD109 solution first for now. Yes, its only in a DEV env so there is no damage.

BabaTataKaka avatar Jul 31 '24 17:07 BabaTataKaka

Is the problem still relevant?

vitabaks avatar Aug 31 '24 10:08 vitabaks

Documentation: https://postgresql-cluster.org/management/restore

vitabaks avatar Aug 31 '24 10:08 vitabaks

Sorry I was away for 3 weeks(vacations).

Still having issue but I think it might be a patroni/posgres problem because the master is being restored succesfully.

Question, when deploying a new cluster (initdb), should the patroni_create_replica_methods be set to pgbackrest and basebackup if using pgbackrest as a backup method? My understanding ,based on the doc, is that it should be only set to pgbackrest when restoring right?

BabaTataKaka avatar Sep 03 '24 18:09 BabaTataKaka

I have written documentation on the topic of restore. Please see the configuration examples. For example, you need to specify the same restore command for the replica in the pgbackrest options, otherwise the replica will be restored to a different state than primary.

And yes, you can remove basebackup from the list of options if necessary.

vitabaks avatar Sep 04 '24 10:09 vitabaks