postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

/pgdata/pgbackrest/log/db-archive-get-async.log unbounded growth on Standby Cluster

Open wmuldergov opened this issue 9 months ago • 0 comments

Overview

If you setup a CrunchyDB cluster using an external repo with s3, the Standby Cluster will create this log: /pgdata/pgbackrest/log/db-archive-get-async.log that gets updated every time it syncs from s3 which looks to be every 5-10 seconds. Since logrotate doesn't seem to be enabled by default on this folder (like it is for /pgdata/pg17/log) this log will continue to grow and eventually cause the space in the pgdata PVC to get exhausted.

Environment

Please provide the following details:

  • Platform: OpenShift
  • Platform Version: 4.16
  • PGO Image Tag: ubi8-17.0-3.4-0
  • Postgres Version: 17
  • Storage: s3

Steps to Reproduce

REPRO

  1. Setup a Primary and Standby Cluster for Crunchy using the External S3 Repo method.
  2. On the Standby Cluster monitor the /pgdata/pgbackrest/log/db-archive-get-async.log size and watch it grow.

EXPECTED

This log file gets rotated like the logs in /pgdata/pg17/log

ACTUAL

The log file keeps growing till the space on the PVC is exhausted.

Logs

-------------------PROCESS START-------------------
2025-03-20 19:02:14.552 P00   INFO: archive-get:async command begin 2.53.1: [00000003000000C500000091, 00000003000000C500000092, 00000003000000C500000093, 00000003000000C500000094, 00000003000000C500000095, 00000003000000C500000096, 00000003000000C500000097, 00000003000000C500000098] --archive-async --exec-id=675601-15d23e31 --log-level-console=off --log-level-stderr=off --log-path=/pgdata/pgbackrest/log --pg1-path=/pgdata/pg17 --repo=2 --repo1-host=<REDACTED> --repo1-host-ca-file=/etc/pgbackrest/conf.d/~postgres-operator/tls-ca.crt --repo1-host-cert-file=/etc/pgbackrest/conf.d/~postgres-operator/client-tls.crt --repo1-host-key-file=/etc/pgbackrest/conf.d/~postgres-operator/client-tls.key --repo1-host-type=tls --repo1-host-user=postgres --repo1-path=/pgbackrest/repo1 --repo2-path=/db/dbbackup --repo2-s3-bucket=<REDACTED> --repo2-s3-endpoint=<REDACTED> --repo2-s3-key=<redacted> --repo2-s3-key-secret=<redacted> --repo2-s3-region=ca-central-1 --repo2-s3-uri-style=path --repo2-type=s3 --spool-path=/pgdata/pgbackrest-spool --stanza=db
2025-03-20 19:02:14.552 P00   INFO: get 8 WAL file(s) from archive: 00000003000000C500000091...00000003000000C500000098
2025-03-20 19:02:14.623 P00   INFO: archive-get:async command end: completed successfully (71ms)

Additional Information

I see this was merged in recently: https://github.com/CrunchyData/postgres-operator/pull/4108 however it looks like they put the logrotate function behind a feature flag for Open Telemetry. This issue would affect anyone that has a cluster setup in Standby mode, so I would suggest not putting it behind the feature flag for Open Telemetry.

wmuldergov avatar Mar 20 '25 21:03 wmuldergov