postgres-operator
postgres-operator copied to clipboard
Turning backups off fills up disk with WAL logs until crash
Overview
If I do not specify a backup repo (e.g. because I don't want backups), then the WAL will grow until storage is exhausted.
Eventually the database crashes with "No space left on device" and doesn't restart, logging "FATAL: could not write lock file "postmaster.pid": No space left on device"
Environment
Please provide the following details:
- Platform: EKS
- Platform Version: 5.0.0
- PGO Image Tag: ubi8-5.0.0-0
- Postgres Version 13
- Storage: ??
Steps to Reproduce
REPRO
Provide steps to get to the error condition:
- create cluster:
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: hippo
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.3-4.7.0
postgresVersion: 13
monitoring:
pgmonitor:
exporter:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.0.0-0
instances:
- name: db
replicas: 1
dataVolumeClaimSpec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-0
- perform 1GB or so of writes
- observe DB is down due to out-of-space
EXPECTED
WAL would not be retained if there is no backup repo.
ACTUAL
WAL logs fill up disk
Additional Information
I think this could be fixed via by setting archive_command
to /bin/true
when there are no backup repos?
Alternatively, you could just make it possible for me to override archive_mode
or archive_command
myself. Currently they are forced to be on:
https://github.com/CrunchyData/postgres-operator/blob/21905f0c2962e22f72aec34a9cad733e31afcdf0/internal/pgbackrest/postgres.go#L39-L40
Other flag that might be useful to fix: "Maximum Archive Push Queue Size Option" (--archive-push-queue-max
)
As docs say:
The purpose of this feature is to prevent the log volume from filling up at which point Postgres will stop completely. Better to lose the backup than have PostgreSQL go down.
@jkatz is this something you might be able to look into?
Any news about this so far? I got the same issue, that why I have to setup pgbackrest to perform backups every 4 hours, because otherwise it WAL files eating all free space.
For the first report, you need to have at least one repo defined. There was a validation check that was incorrect that will be fixed in an upcoming release (#2662).
If archive mode / command is enabled, it's a PostgreSQL feature to retain WAL logs until they are successfully pushed to the archive. Pertaining to the second issue, If you are accumulating WAL, there may be a separate issue that you need to look into. I recommend looking at the PostgreSQL logs (in the /pgdata/.../log directory) to see if the archive push is failing, and if so what the reason is.
If archive mode / command is enabled, it's a PostgreSQL feature to retain WAL logs until they are successfully pushed to the archive.
As I mentioned above, archive-push-queue-max
is meant to control this. I'd rather lose WAL/point-in-time-recovery than have the database down completely.
Pertaining to the second issue, If you are accumulating WAL, there may be a separate issue that you need to look into. I recommend looking at the PostgreSQL logs (in the /pgdata/.../log directory) to see if the archive push is failing, and if so what the reason is.
archive-push is failing because the WAL receiver was out of space. The WAL receiver is out of space because for some reason it's not listening to my config where I tell it to only keep max 1 byte of data.
What I really want to do here is run a postgres without backups... I can't seem to find any way to run postgres-operator without the essentially infinite space of S3.
Hi @james-callahan,
As workaround for dev cluster we did psql
and alter system set archive_mode=off;
, then restart.
It will create postgresql.auto.conf
which will be applied after postgresql.conf
and will rewrite archive_mode
to off
How many years is it going to take to implement this...?
This one can cause serious issues.
For anyone hitting this issue, the solution that worked for us is by enabling async archive mode, correctly setting spool path and then lowering the push queue:
backups:
pgbackrest:
global:
archive-async: "y"
archive-push-queue-max: "100GiB" # Change this to your desired max
spool-path: "/pgdata/backups"
Once added, you'll see logs of postgres dropping WAL files and pgbackrest loop through the archive. In worst case, all WAL will be dropped and a full backup is needed, alleviating the issue with archive_mode
always set to on.
I am using the operator on an on premise cluster and sadly I don't have infinite disk space. How do I set archive_mode=off
?
Just dropping a quick update on this thread to note that we are currently taking a look at some upcoming plans and changes to CPK to allow allow additional flexibility around disabling backups and archiving (as requested in this issue).
Therefore, I simply wanted to note that this issue is definitely on our radar, and we will be in touch with additional information, details, etc. once available.