app icon indicating copy to clipboard operation
app copied to clipboard

Fix that pgbackrest sometimes stops operating in prod cluster (at least add alerts!)

Open Venryx opened this issue 1 year ago • 1 comments

Venryx avatar Jun 25 '24 23:06 Venryx

Update

After restoring the database (open terminal in stuck db pod -> scp contents to other server -> launch same version of postgres with the pgdata directory from scp transfer -> pgdump from that temp instance -> clear PVC in prod cluster, and import from pgdump), the pgbackrest backups started working again. (first new backup on June 26th)

On July 25th though, the database pod got its PVC to 100% storage usage again, causing the issue again. I checked the pgbackrest backups at this point, and the last successful one had been on July 20th.

In summary: Pgbackrest config might actually be fine; but there is something causing the backups to fail at some point. (and no alerting in place when that happens! could detect by checking the "Conditions" column of the Kubernetes Jobs in postgres-operator namespace)

Venryx avatar Jul 25 '24 08:07 Venryx