helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1

Open imranrazakhan opened this issue 2 years ago • 13 comments

We have following environment

  • Three-nodes cluster deployed using chart timescaledb-single - 0.10.0 in test environment
  • Kubernetes 1.22.3
  • Storage is configured using local persistent volume.
  • Backup is not configuered, i set backup=false in chart

Today i want to clean disk so i scale down timescaledb pods and clean the disk from all three nodes and then scale up pods, but i am getting following error , am i missing something? is there any way to start from blank disk again?

# kubectl -n dev logs -f timescaledb-0 -c timescaledb
2021-12-04 21:55:53,775 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2021-12-04 21:55:53,775 ERROR: failed to bootstrap (without leader)
2021-12-04 21:56:04,206 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2021-12-04 21:56:04,207 ERROR: failed to bootstrap (without leader)
2021-12-04 21:56:14,207 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2021-12-04 21:56:14,207 ERROR: failed to bootstrap (without leader)
2021-12-04 21:56:24,207 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2021-12-04 21:56:24,207 ERROR: failed to bootstrap (without leader)
2021-12-04 21:56:34,207 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1

I logged into timescaledb pod and check patroni status, this is first instance and why its role is replica rather than master?

$ patronictl list
+ Cluster: yq (uninitialized) +---------+---------+----+-----------+
| Member        | Host        | Role    | State   | TL | Lag in MB |
+---------------+-------------+---------+---------+----+-----------+
| timescaledb-0 | 10.244.0.78 | Replica | stopped |    |   unknown |
+---------------+-------------+---------+---------+----+-----------+

imranrazakhan avatar Dec 05 '21 00:12 imranrazakhan

@imranrazakhan Delete k8s services from this chart (load balancer, nodeip related ones depending on your config) from the previous helm deployment should resolve this issue, as it did for me.

davidandreoletti avatar Dec 07 '21 12:12 davidandreoletti

@davidandreoletti Thanks for updates i will check this, Can we have more insight why we have to delete services? is it related to endpoint? i check ep yaml file but couldn't find any hint which stopping us to do clean start?

imranrazakhan avatar Dec 07 '21 12:12 imranrazakhan

Having the same issue. Deleting the resources from the previous helm deployment did not solve the issue for me.

jholm117 avatar Jan 24 '22 22:01 jholm117

Same issue - and confirmed no resources left in cluster from previous install.

bleggett avatar Jan 27 '22 16:01 bleggett

Having the same issue. Deleting the resources from the previous helm deployment did not solve the issue for me.

I was able to get this working eventually, it's possible I missed cleaning up an endpoint or something.

jholm117 avatar Jan 28 '22 14:01 jholm117

@jholm117 @davidandreoletti we can fix issue by just deleting one ep (EndPoint) with name like clustername-config, where clustername is name provided during helm installation.

imranrazakhan avatar Mar 24 '22 10:03 imranrazakhan

I am still seeing this issue after using different release name and deleting older endpoints. It just stops suddenly after sometime. Any different solutions would be greatly appreciated. Thanks.

veereshhalagegowda avatar Aug 26 '22 14:08 veereshhalagegowda

Same here. It is happening in the latest release 0.27.4 It resolves automatically after a few minutes

jleni avatar Jan 01 '23 18:01 jleni

Same issue here. Happens on latest 0.27.5 as well. Would be good to see this fixed finally

jprecuch avatar Jan 23 '23 14:01 jprecuch

Same issue here, moving the deployment to a new namespace solved it temporarilly for me

jfaldanam avatar Jan 24 '23 07:01 jfaldanam

Removing endpoints from a previous helm deployment solved it for me.

JohnTzoumas avatar May 24 '23 16:05 JohnTzoumas

@JohnTzoumas thanks a lot! I have the same issue with the latest version.

I test disaster recovery right now and killed all PVCs + PODs. The startup of the new timescale pod stops at:

timescaledb 2023-05-25 11:53:56,422 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1

When I delete the 4 endpoints the recovery runs through.

ayeks avatar May 25 '23 12:05 ayeks

I have the same issue. But in my case I have disabled the persistent storage, because in our dev environment we would like to clean the db by just restarting the container. I have also tried to set this to false: patroni.postgresql.pgbackrest.keep_data = false but no effect.

MSandro avatar Dec 21 '23 08:12 MSandro