postgresql_cluster
postgresql_cluster copied to clipboard
PITR: Cluster stuck in running state
I'm testing the PITR and running the following .
ansible-playbook eploy_cluster.yml --tags point_in_time_recovery
I've set the following value in the main.yml:
patroni_cluster_bootstrap_method: "pgbackrest"
patroni_create_replica_methods:
#Uncomment pgbackrest only when restoring DB
- pgbackrest
- basebackup
postgresql_restore_command: "pgbackrest --stanza={{ pgbackrest_stanza }} archive-get %f %p" # restore WAL-s using pgbackrest
'/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --type=time "--target=2024-07-31 13:30:00" --delta restore' # Point-in-Time Recovery (example)
The restore is successfully on the master only(replication fail?) and the state in patroni is stuck at running:
"+ Cluster: postgres-cluster (7397789376619538656) ---+-----------+", "| Member | Host | Role | State | TL | Lag in MB |", "+----------+----------------+---------+---------+----+-----------+", "| svr001 | svr001.dh.pa | Leader | running | 3 | |", "| svr002 | svr002.dh.pa | Replica | running | 1 | 0 |", "| svr003 | svr003.dh.pa.ll | Replica | running | 1 | 0 |", "+----------+----------------+---------+---------+----+-----------+"
Postgres master logs:
2024-07-31 13:55:13 UTC [34061-4] 10.136.16.118(38560) replicator@[unknown] ERROR: requested starting point 0/8000000 on timeline 1 is not in this server's history
2024-07-31 13:55:13 UTC [34061-5] 10.136.16.118(38560) replicator@[unknown] DETAIL: This server's history forked from timeline 1 at 0/7000170.
2024-07-31 13:55:13 UTC [34061-6] 10.136.16.118(38560) replicator@[unknown] STATEMENT: START_REPLICATION SLOT "svr002" 0/8000000 TIMELINE 1
Postgres replicas logs:
2024-07-31 14:10:23 UTC [31599-1] FATAL: could not start WAL streaming: ERROR: requested starting point 0/8000000 on timeline 1 is not in this server's history
DETAIL: This server's history forked from timeline 1 at 0/7000170.
This is from a PITR from an existing cluster. Looks like its working when restoring from a new cluster(no point_in_time_recovery tag). Looks like a problem wit the timeline but not sure why. There is no error in patroni or pgbackrest logs on both the master and replicas.
Hi @BabaTataKaka
It looks like you have a problem in the backup repository, perhaps several different clusters performed archiving to the same repository (one stanza).
This question is not related to the automation issue, but rather to the maintenance issue.
@BabaTataKaka Hi, please tell me, what kind of archive command do you have? And do you have backup full/diff?
@BabaTataKaka Hi, please tell me, what kind of archive command do you have? And do you have backup full/diff?
pgbackrest --stanza={{ pgbackrest_stanza }} archive-push %p
Full backup
Hi @BabaTataKaka
It looks like you have a problem in the backup repository, perhaps several different clusters performed archiving to the same repository (one stanza).
This question is not related to the automation issue, but rather to the maintenance issue.
This is from a new cluster. Here what i did : 1-Deployed the cluster. 2-Entered data in DB 3-Ran backup from pgbackrest server 4- Deleted data from DB 5- Ran point_in_time_recovery.
Based on my research and like you mentioned, this issue actually happen when you try to restore a db to a different cluster. However, in this case, its from the same cluster. So , i was wondering may be I'm not using the playbook correctly?
@BabaTataKaka, I see you have --target=2024-07-31 13:30:00, what time did you launch point_in_time_recovery? There is an assumption that the necessary WALs for time recovery did not have time to be on the pgbackrest server. Try to recover first with the help of:
pgbackrest_patroni_cluster_restore_command: '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --delta restore' # restore from latest backup
After that, try to do point_in_time_recovery in 1 hour. For example: You need to recover at 13:30, and start it at 14:30
Please take a moment to read about the error "requested starting point XXXXXX on timeline X is not in this server's history" on the Internet for better understanding.
Additionally, consider the following suggestions:
- If this is a test environment, you might want to clear the backup directory.
- Alternatively, you can create a new stanza, perform a fresh backup, and then attempt the recovery again.
- Or, try with
basebackuponly inpatroni_create_replica_methods. To restore Primary from a backup and re-initialize replicas using pg_basebackup.
@BabaTataKaka, I see you have --target=2024-07-31 13:30:00, what time did you launch point_in_time_recovery? There is an assumption that the necessary WALs for time recovery did not have time to be on the pgbackrest server. Try to recover first with the help of:
pgbackrest_patroni_cluster_restore_command: '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --delta restore' # restore from latest backup
After that, try to do point_in_time_recovery in 1 hour. For example: You need to recover at 13:30, and start it at 14:30
I lauched the recoverdy 2-3 mins after. I will try your suggestion. Thanks.
Please take a moment to read about the error "
requested starting point XXXXXX on timeline X is not in this server's history" on the Internet for better understanding.Additionally, consider the following suggestions:
* If this is a test environment, you might want to clear the backup directory. * Alternatively, you can create a new stanza, perform a fresh backup, and then attempt the recovery again. * Or, try with `basebackup` only in `patroni_create_replica_methods`. To restore Primary from a backup and re-initialize replicas using pg_basebackup.
Please take a moment to read about the error "
requested starting point XXXXXX on timeline X is not in this server's history" on the Internet for better understanding.Additionally, consider the following suggestions:
* If this is a test environment, you might want to clear the backup directory. * Alternatively, you can create a new stanza, perform a fresh backup, and then attempt the recovery again. * Or, try with `basebackup` only in `patroni_create_replica_methods`. To restore Primary from a backup and re-initialize replicas using pg_basebackup.
I did , but I guess I'll have to dig a lot deeper. Thanks for your suggestion. I will try SVD109 solution first for now. Yes, its only in a DEV env so there is no damage.
Is the problem still relevant?
Documentation: https://postgresql-cluster.org/management/restore
Sorry I was away for 3 weeks(vacations).
Still having issue but I think it might be a patroni/posgres problem because the master is being restored succesfully.
Question, when deploying a new cluster (initdb), should the patroni_create_replica_methods be set to pgbackrest and basebackup if using pgbackrest as a backup method? My understanding ,based on the doc, is that it should be only set to pgbackrest when restoring right?
I have written documentation on the topic of restore. Please see the configuration examples. For example, you need to specify the same restore command for the replica in the pgbackrest options, otherwise the replica will be restored to a different state than primary.
And yes, you can remove basebackup from the list of options if necessary.