postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

Turning backups off fills up disk with WAL logs until crash

Open james-callahan opened this issue 3 years ago • 11 comments

Overview

If I do not specify a backup repo (e.g. because I don't want backups), then the WAL will grow until storage is exhausted.

Eventually the database crashes with "No space left on device" and doesn't restart, logging "FATAL: could not write lock file "postmaster.pid": No space left on device"

Environment

Please provide the following details:

  • Platform: EKS
  • Platform Version: 5.0.0
  • PGO Image Tag: ubi8-5.0.0-0
  • Postgres Version 13
  • Storage: ??

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

  1. create cluster:
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: hippo
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.3-4.7.0
  postgresVersion: 13
  monitoring:
    pgmonitor:
      exporter:
        image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.0.0-0
  instances:
    - name: db
      replicas: 1
      dataVolumeClaimSpec:
        accessModes: [ReadWriteOnce]
        resources:
          requests:
            storage: 1Gi
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-0
  1. perform 1GB or so of writes
  2. observe DB is down due to out-of-space

EXPECTED

WAL would not be retained if there is no backup repo.

ACTUAL

WAL logs fill up disk

Additional Information

I think this could be fixed via by setting archive_command to /bin/true when there are no backup repos?

Alternatively, you could just make it possible for me to override archive_mode or archive_command myself. Currently they are forced to be on: https://github.com/CrunchyData/postgres-operator/blob/21905f0c2962e22f72aec34a9cad733e31afcdf0/internal/pgbackrest/postgres.go#L39-L40

james-callahan avatar Jul 12 '21 04:07 james-callahan

Other flag that might be useful to fix: "Maximum Archive Push Queue Size Option" (--archive-push-queue-max)

As docs say:

The purpose of this feature is to prevent the log volume from filling up at which point Postgres will stop completely. Better to lose the backup than have PostgreSQL go down.

james-callahan avatar Jul 12 '21 07:07 james-callahan

@jkatz is this something you might be able to look into?

james-callahan avatar Aug 12 '21 07:08 james-callahan

Any news about this so far? I got the same issue, that why I have to setup pgbackrest to perform backups every 4 hours, because otherwise it WAL files eating all free space.

Bluesboy avatar Sep 18 '21 18:09 Bluesboy

For the first report, you need to have at least one repo defined. There was a validation check that was incorrect that will be fixed in an upcoming release (#2662).

If archive mode / command is enabled, it's a PostgreSQL feature to retain WAL logs until they are successfully pushed to the archive. Pertaining to the second issue, If you are accumulating WAL, there may be a separate issue that you need to look into. I recommend looking at the PostgreSQL logs (in the /pgdata/.../log directory) to see if the archive push is failing, and if so what the reason is.

jkatz avatar Sep 19 '21 21:09 jkatz

If archive mode / command is enabled, it's a PostgreSQL feature to retain WAL logs until they are successfully pushed to the archive.

As I mentioned above, archive-push-queue-max is meant to control this. I'd rather lose WAL/point-in-time-recovery than have the database down completely.

Pertaining to the second issue, If you are accumulating WAL, there may be a separate issue that you need to look into. I recommend looking at the PostgreSQL logs (in the /pgdata/.../log directory) to see if the archive push is failing, and if so what the reason is.

archive-push is failing because the WAL receiver was out of space. The WAL receiver is out of space because for some reason it's not listening to my config where I tell it to only keep max 1 byte of data.


What I really want to do here is run a postgres without backups... I can't seem to find any way to run postgres-operator without the essentially infinite space of S3.

james-callahan avatar Sep 20 '21 00:09 james-callahan

Hi @james-callahan, As workaround for dev cluster we did psql and alter system set archive_mode=off;, then restart. It will create postgresql.auto.conf which will be applied after postgresql.conf and will rewrite archive_mode to off

ghost avatar Jan 26 '22 10:01 ghost

How many years is it going to take to implement this...?

laurivosandi avatar Feb 26 '23 08:02 laurivosandi

This one can cause serious issues.

schrepfler avatar Jun 29 '23 12:06 schrepfler

For anyone hitting this issue, the solution that worked for us is by enabling async archive mode, correctly setting spool path and then lowering the push queue:

      backups:
        pgbackrest:
          global:
            archive-async: "y"
            archive-push-queue-max: "100GiB" # Change this to your desired max
            spool-path: "/pgdata/backups"

Once added, you'll see logs of postgres dropping WAL files and pgbackrest loop through the archive. In worst case, all WAL will be dropped and a full backup is needed, alleviating the issue with archive_mode always set to on.

jleeh avatar Sep 11 '23 11:09 jleeh

I am using the operator on an on premise cluster and sadly I don't have infinite disk space. How do I set archive_mode=off ?

RobKenis avatar Feb 02 '24 08:02 RobKenis

Just dropping a quick update on this thread to note that we are currently taking a look at some upcoming plans and changes to CPK to allow allow additional flexibility around disabling backups and archiving (as requested in this issue).

Therefore, I simply wanted to note that this issue is definitely on our radar, and we will be in touch with additional information, details, etc. once available.

andrewlecuyer avatar Jul 09 '24 17:07 andrewlecuyer