Backups getting stucked on walArchivingFailing phase
Have deployed the cnpg operator and cluster successfully on my eks cluster, but the scheduled backups is getting stucked at walArchivingFailing phase. Below given is my helm values.yaml file for the cluster ->
`# -- Override the name of the chart
nameOverride: ""
# -- Override the full name of the chart
fullnameOverride: ""
###
# -- Type of the CNPG database. Available types:
# * `postgresql`
# * `postgis`
type: postgresql
###
# -- Cluster mode of operation. Available modes:
# * `standalone` - default mode. Creates new or updates an existing CNPG cluster.
# * `replica` - Creates a replica cluster from an existing CNPG cluster. # TODO
# * `recovery` - Same as standalone but creates a cluster from a backup, object store or via pg_basebackup.
mode: standalone
recovery:
##
# -- Available recovery methods:
# * `backup` - Recovers a CNPG cluster from a CNPG backup (PITR supported) Needs to be on the same cluster in the same namespace.
# * `object_store` - Recovers a CNPG cluster from a barman object store (PITR supported).
# * `pg_basebackup` - Recovers a CNPG cluster viaa streaming replication protocol. Useful if you want to
# migrate databases to CloudNativePG, even from outside Kubernetes. # TODO
method: backup
## -- Point in time recovery target. Specify one of the following:
pitrTarget:
# -- Time in RFC3339 format
time: ""
##
# -- Backup Recovery Method
backupName: "" # Name of the backup to recover from. Required if method is `backup`.
##
# -- The original cluster name when used in backups. Also known as serverName.
clusterName: ""
# -- Overrides the provider specific default endpoint. Defaults to:
# S3: https://s3.<region>.amazonaws.com"
# Leave empty if using the default S3 endpoint
endpointURL: ""
# -- Specifies a CA bundle to validate a privately signed certificate.
endpointCA:
# -- Creates a secret with the given value if true, otherwise uses an existing secret.
create: false
name: ""
key: ""
value: ""
# -- Overrides the provider specific default path. Defaults to:
# S3: s3://<bucket><path>
# Azure: https://<storageAccount>.<serviceName>.core.windows.net/<containerName><path>
# Google: gs://<bucket><path>
destinationPath: ""
# -- One of `s3`, `azure` or `google`
provider: s3
s3:
region: ""
bucket: ""
path: "/"
accessKey: ""
secretKey: ""
azure:
path: "/"
connectionString: ""
storageAccount: ""
storageKey: ""
storageSasToken: ""
containerName: ""
serviceName: blob
inheritFromAzureAD: false
google:
path: "/"
bucket: ""
gkeEnvironment: false
applicationCredentials: ""
cluster:
# -- Number of instances
instances: 2
# -- Name of the container image, supporting both tags (<image>:<tag>) and digests for deterministic and repeatable deployments:
# <image>:<tag>@sha256:<digestValue>
imageName: "" # Default value depends on type (postgresql/postgis/timescaledb)
# -- Image pull policy. One of Always, Never or IfNotPresent. If not defined, it defaults to IfNotPresent. Cannot be updated.
# More info: https://kubernetes.io/docs/concepts/containers/images#updating-images
imagePullPolicy: Always
# -- The list of pull secrets to be used to pull the images.
# See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-LocalObjectReference
imagePullSecrets: []
storage:
size: 20Gi
storageClass: "gp3"
# -- The UID of the postgres user inside the image, defaults to 26
postgresUID: 26
# -- The GID of the postgres user inside the image, defaults to 26
postgresGID: 26
# -- Resources requirements of every generated Pod.
# Please refer to https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ for more information.
# We strongly advise you use the same setting for limits and requests so that your cluster pods are given a Guaranteed QoS.
# See: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
resources:
limits:
cpu: 2000m
memory: 8Gi
requests:
cpu: 200m
memory: 1Gi
priorityClassName: ""
# -- Method to follow to upgrade the primary server during a rolling update procedure, after all replicas have been
# successfully updated. It can be switchover (default) or in-place (restart).
primaryUpdateMethod: switchover
# -- Strategy to follow to upgrade the primary server during a rolling update procedure, after all replicas have been
# successfully updated: it can be automated (unsupervised - default) or manual (supervised)
primaryUpdateStrategy: unsupervised
# -- The instances' log level, one of the following values: error, warning, info (default), debug, trace
logLevel: "debug"
# -- Affinity/Anti-affinity rules for Pods.
# See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-AffinityConfiguration
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: Deployment
operator: In
values:
- copilot
# topologyKey: topology.kubernetes.io/zone
# -- The configuration for the CA and related certificates.
# See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-CertificatesConfiguration
certificates:
# -- When this option is enabled, the operator will use the SuperuserSecret to update the postgres user password.
# If the secret is not present, the operator will automatically create one.
# When this option is disabled, the operator will ignore the SuperuserSecret content, delete it when automatically created,
# and then blank the password of the postgres user by setting it to NULL.
enableSuperuserAccess: true
superuserSecret: ""
monitoring:
# -- Whether to enable monitoring
enabled: true
podMonitor:
# -- Whether to enable the PodMonitor
enabled: true
prometheusRule:
# -- Whether to enable the PrometheusRule automated alerts
enabled: true
# -- Exclude specified rules
excludeRules: []
# - CNPGClusterZoneSpreadWarning
# -- Custom Prometheus metrics
customQueries:
- name: "pg_cache_hit_ratio"
query: "SELECT current_database() as datname, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio FROM pg_statio_user_tables;"
metrics:
- datname:
usage: "LABEL"
description: "Name of the database"
- ratio:
usage: GAUGE
description: "Cache hit ratio"
# -- Configuration of the PostgreSQL server.
# See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-PostgresConfiguration
postgresql: {}
# max_connections: 300
# -- BootstrapInitDB is the configuration of the bootstrap process when initdb is used.
# See: https://cloudnative-pg.io/documentation/current/bootstrap/
# See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-bootstrapinitdb
initdb: {}
# database: app
# owner: "" # Defaults to the database name
# secret: "" # Name of the secret containing the initial credentials for the owner of the user database. If empty a new secret will be created from scratch
# postInitSQL:
# - CREATE EXTENSION IF NOT EXISTS vector;
additionalLabels: {}
annotations:
cnpg.io/skipWalArchiving: "enabled"
backups:
# -- You need to configure backups manually, so backups are disabled by default.
enabled: true
# -- Overrides the provider specific default endpoint. Defaults to:
# S3: https://s3.<region>.amazonaws.com"
endpointURL: "" # Leave empty if using the default S3 endpoint
# -- Specifies a CA bundle to validate a privately signed certificate.
endpointCA:
# -- Creates a secret with the given value if true, otherwise uses an existing secret.
create: false
name: ""
key: ""
value: ""
# -- Overrides the provider specific default path. Defaults to:
# S3: s3://<bucket><path>
# Azure: https://<storageAccount>.<serviceName>.core.windows.net/<containerName><path>
# Google: gs://<bucket><path>
destinationPath: ""
# -- One of `s3`, `azure` or `google`
provider: s3
s3:
region: "us-east-1"
bucket: "ds-mongodb-backup"
path: "/postgresbackup/"
accessKey: "AKIAZMOEICVTYDS2R5RXHB"
secretKey: "eWKStVPqqI2nzq5uYCJKDWlOTxQT2RjlsWFsnx2CYR"
azure:
path: "/"
connectionString: ""
storageAccount: ""
storageKey: ""
storageSasToken: ""
containerName: ""
serviceName: blob
inheritFromAzureAD: false
google:
path: "/"
bucket: ""
gkeEnvironment: false
applicationCredentials: ""
# wal:
# # -- WAL compression method. One of `` (for no compression), `gzip`, `bzip2` or `snappy`.
# compression: gzip
# # -- Whether to instruct the storage provider to encrypt WAL files. One of `` (use the storage container default), `AES256` or `aws:kms`.
# encryption: AES256
# # -- Number of WAL files to be archived or restored in parallel.
# maxParallel: 1
data:
# -- Data compression method. One of `` (for no compression), `gzip`, `bzip2` or `snappy`.
compression: gzip
# -- Whether to instruct the storage provider to encrypt data files. One of `` (use the storage container default), `AES256` or `aws:kms`.
encryption: AES256
# -- Number of data files to be archived or restored in parallel.
jobs: 2
scheduledBackups:
- name: daily-backup
schedule: "0 0 10 * * *"
backupOwnerReference: self
# -- Retention policy for backups
retentionPolicy: "15d"
pooler:
# -- Whether to enable PgBouncer
enabled: true
# -- PgBouncer pooling mode
poolMode: transaction
# -- Number of PgBouncer instances
instances: 1
# -- PgBouncer configuration parameters
parameters:
max_client_conn: "1000"
default_pool_size: "25"
monitoring:
# -- Whether to enable monitoring
enabled: true
podMonitor:
# -- Whether to enable the PodMonitor
enabled: true
# -- Custom PgBouncer deployment template.
# Use to override image, specify resources, etc.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: Deployment
operator: In
values:
- copilot
template: {}`
Could anyonee please help me with the same why is it getting stuck and not working properly. Also if anyone can help me where I can find the backup logs.
The backup logs are in the same pod as the main postgres instance. You should find out the cause of the failure there. I had the same problem, and in my particular case it was because the S3 bucket already had some data from a previous test on it, and the error was like that:
ERROR: WAL archive check failed for server hippo-cluster: Expected empty archive
I'm closing this for now. If you are still having the issue could you please attach some logs from the primary instance?