helm-charts
helm-charts copied to clipboard
WAL Volume Usage Leaks
Describe the bug
WAL volume usage starts to leak after some arbitrary block of time, despite having configured pgbackrest
properly, still getting successful backups to S3, and observing WAL pruning work properly for at least 12 hours prior.
To Reproduce 🤷
Expected behavior WAL archival works consistently.
Deployment
- What is in your
values.yaml
We manage infrastructure with Pulumi, which enables us to manage Helm charts with TypeScript, so this will look like a basic Javascript object instead of raw YAML.
name: `timescale-${args.name}`,
patroni: {
bootstrap: {
dcs: {
postgresql: {
parameters: {
'archive_command': "/etc/timescaledb/scripts/pgbackrest_archive.sh %p",
'archive_mode': 'on',
'archive_timeout': '180',
'autovacuum_analyze_scale_factor': '0.02',
'autovacuum_naptime': '5s',
'autovacuum_max_workers': '10',
'autovacuum_vacuum_cost_limit': '500',
'autovacuum_vacuum_scale_factor': '0.05',
'log_autovacuum_min_duration': '1min',
'hot_standby': 'on',
'log_checkpoints': 'on',
'log_connections': 'on',
'log_disconnections': 'on',
'log_line_prefix': "%t [%p]: [%c-%l] %u@%d,app=%a [%e] ",
'log_lock_waits': 'on',
'log_min_duration_statement': '1s',
'log_statement': 'ddl',
'max_wal_size': '4GB',
'max_connections': '100',
'max_prepared_transactions': '150',
'shared_preload_libraries': 'timescaledb,pg_stat_statements',
'ssl': 'on',
'ssl_cert_file': '/etc/certificate/tls.crt',
'ssl_key_file': '/etc/certificate/tls.key',
'tcp_keepalives_idle': '900',
'tcp_keepalives_interval': '100',
'temp_file_limit': '1GB',
'timescaledb.passfile': '../.pgpass',
'unix_socket_directories': '/var/run/postgresql',
'unix_socket_permissions': '0750',
'wal_level': 'hot_standby',
'wal_log_hints': 'on',
'use_pg_rewind': 'true',
'use_slots': 'true'
}
}
}
}
},
backup: {
enabled: true,
pgBackRest: {
'archive-async': 'y',
'archive-timeout': '180',
'compress-type': 'lz4',
'process-max': '4',
'start-fast': 'y',
'repo1-retention-diff': '2',
'repo1-retention-full': '2',
'repo1-type': 's3',
'repo1-cipher-type': 'none',
'repo1-s3-endpoint': 's3.amazonaws.com'
},
env: [
{
name: 'PGBACKREST_REPO1_S3_BUCKET',
value: redacted
},
{
name: 'PGBACKREST_REPO1_S3_REGION',
value: redacted
},
{
name: 'PGBACKREST_REPO1_S3_KEY',
value: redacted
},
{
name: 'PGBACKREST_REPO1_S3_KEY_SECRET',
value: redacted
}
],
jobs: [
{
name: 'full',
type: 'full',
// nightly @ 12am
schedule: '0 0 * * *'
},
{
name: 'incremental',
type: 'incr',
// every 6 hours except midnight
schedule: '0 6,12,18 * * *'
}
]
},
persistentVolumes: {
data: {
size: args.storageSize ? `${args.storageSize}Gi` : `20Gi`
},
wal: {
size: args.storageSizeWAL ? `${args.storageSizeWAL}Gi` : '500Gi'
}
},
pgBouncer: {
enabled: true,
config: {
server_reset_query: 'DISCARD ALL',
max_client_conn: '500',
default_pool_size: '12',
pool_mode: 'transaction',
// https://github.com/timescale/timescaledb-kubernetes/pull/391
auth_file: '/etc/pgbouncer/userlist/userlist.txt/userlist.txt'
},
pg_hba: [
'local all postgres peer',
'host all postgres,standby 0.0.0.0/0 password',
'host all postgres,standby ::0/0 password',
'hostssl all all 0.0.0.0/0 password',
'hostssl all all ::0/0 password',
'hostnossl all all 0.0.0.0/0 password',
'hostnossl all all ::0/0 password'
],
userListSecretName: `timescale-${args.name}-credentials`
},
prometheus: {
enabled: true
},
resources: args.resources ?? undefined,
/**
* Secret must contain env vars that configure Patroni per:
* https://patroni.readthedocs.io/en/latest/ENVIRONMENT.html
*/
secrets: {
credentialsSecretName: `timescale-${args.name}-credentials`,
certificateSecretName: certificate.metadata.apply(cert => cert.name),
},
podAnnotations: {
'prometheus.io/scrape': 'true',
'prometheus.io/port': '9187'
},
// postgres version, this will not be reflected in the version of the image being run
// https://github.com/timescale/timescaledb-kubernetes/blob/a12dd47a2339ce1bbacde728f3eeb94309ce0e6f/charts/timescaledb-single/values.schema.yaml#L533
version: args.pgversion ?? 13
},
transformations: [
(manifest: any) => {
// jobs created by this helm chart will hang forever if the istio sidecar is injected
if (manifest.kind === 'Job') {
manifest.spec.template.metadata.annotations = { 'sidecar.istio.io/inject': 'false' }
}
if (manifest.kind === 'CronJob') {
manifest.spec.jobTemplate.spec.template.metadata.annotations = { 'sidecar.istio.io/inject': 'false' }
}
}
]
-
What version of the Chart are you using?
0.13.1
-
What is your Kubernetes Environment (for exampe: GKE, EKS, minikube, microk8s) EKS 1.21.
Deployment
NAME READY STATUS RESTARTS AGE ROLE
pod/timescale-redacted-timescaledb-0 5/5 Running 0 23m master
pod/timescale-redacted-timescaledb-1 5/5 Running 0 34m replica
pod/timescale-redacted-timescaledb-2 5/5 Running 0 35m replica
pod/timescale-redacted-timescaledb-full-27645120-qxt5b 0/1 Completed 0 43h
pod/timescale-redacted-timescaledb-full-27646560-w4xfp 0/1 Completed 0 19h
pod/timescale-redacted-timescaledb-incremental-27646920-8fhjh 0/1 Completed 0 13h
pod/timescale-redacted-timescaledb-incremental-27647280-6pvgz 0/1 Completed 0 7h48m
pod/timescale-redacted-timescaledb-incremental-27647640-gpq8v 0/1 Completed 0 108m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ROLE
service/timescale-redacted LoadBalancer 172.20.24.175 redacted.elb.amazonaws.com 6432:32495/TCP,5432:32670/TCP 5d22h master
service/timescale-redacted-config ClusterIP None <none> 8008/TCP 5d22h
service/timescale-redacted-prometheus ClusterIP 172.20.122.84 <none> 9187/TCP 5d22h
service/timescale-redacted-replica ClusterIP 172.20.223.252 <none> 6432/TCP,5432/TCP 5d22h replica
service/timescale-redacted-timescaledb-backup ClusterIP 172.20.198.150 <none> 8081/TCP 5d22h
NAME READY AGE ROLE
statefulset.apps/timescale-redacted-timescaledb 3/3 5d22h
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ROLE
cronjob.batch/timescale-redacted-timescaledb-full 0 0 * * * False 0 19h 5d22h
cronjob.batch/timescale-redacted-timescaledb-incremental 0 6,12,18 * * * False 0 108m 5d22h
NAME COMPLETIONS DURATION AGE ROLE
job.batch/timescale-redacted-patroni-ua 1/1 2s 4h4m
job.batch/timescale-redacted-timescaledb-full-27643680 1/1 9s 2d19h
job.batch/timescale-redacted-timescaledb-full-27645120 1/1 4s 43h
job.batch/timescale-redacted-timescaledb-full-27646560 1/1 6s 19h
job.batch/timescale-redacted-timescaledb-increment-manual-8mc 1/1 3s 4d22h
job.batch/timescale-redacted-timescaledb-incremental-27639720 0/1 5d13h 5d13h
job.batch/timescale-redacted-timescaledb-incremental-27646920 1/1 3s 13h
job.batch/timescale-redacted-timescaledb-incremental-27647280 1/1 4s 7h48m
job.batch/timescale-redacted-timescaledb-incremental-27647640 1/1 4s 108m
NAME TYPE DATA AGE ROLE
secret/timescale-redacted-certificate kubernetes.io/tls 2 5d22h
secret/timescale-redacted-credentials Opaque 4 4d2h
secret/timescale-redacted-pgbackrest Opaque 7 5d22h
secret/timescale-redacted-timescaledb-token-x26s6 kubernetes.io/service-account-token 3 5d22h
NAME DATA AGE ROLE
configmap/timescale-redacted-timescaledb-patroni 1 4d22h
configmap/timescale-redacted-timescaledb-pgbackrest 1 4d22h
configmap/timescale-redacted-timescaledb-pgbouncer 2 5d22h
configmap/timescale-redacted-timescaledb-scripts 8 5d22h
NAME ENDPOINTS AGE ROLE
endpoints/timescale-redacted 10.0.62.236:5432,10.0.62.236:6432 5d22h
endpoints/timescale-redacted-config 10.0.55.166:8008,10.0.62.236:8008,10.0.65.176:8008 5d22h
endpoints/timescale-redacted-prometheus 10.0.55.166:9187,10.0.62.236:9187,10.0.65.176:9187 5d22h
endpoints/timescale-redacted-replica 10.0.55.166:6432,10.0.65.176:6432,10.0.55.166:5432 5d22h replica
endpoints/timescale-redacted-timescaledb-backup 10.0.62.236:8081 5d22h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ROLE
persistentvolumeclaim/storage-volume-timescale-redacted-timescaledb-0 Bound pvc-979aeaef-637c-4bfe-b264-a72408253954 1000Gi RWO gp2 52d
persistentvolumeclaim/storage-volume-timescale-redacted-timescaledb-1 Bound pvc-19bd29c4-55c7-419c-9c31-e95d83598feb 1000Gi RWO gp2 52d
persistentvolumeclaim/storage-volume-timescale-redacted-timescaledb-2 Bound pvc-76dc3def-d34b-417f-9e04-75b66cdc13c8 1000Gi RWO gp2 52d
persistentvolumeclaim/wal-volume-timescale-redacted-timescaledb-0 Bound pvc-1f148b09-75db-43d0-8dd7-ccd1dc8e83b1 1000Gi RWO gp2 52d
persistentvolumeclaim/wal-volume-timescale-redacted-timescaledb-1 Bound pvc-a970b81a-54a1-44ef-92bd-256a2ef1e89e 1000Gi RWO gp2 52d
persistentvolumeclaim/wal-volume-timescale-redacted-timescaledb-2 Bound pvc-6a9d9f85-ad17-4e22-8170-7bc9d25f59ae 1000Gi RWO gp2 52d
Logs
timescale-redacted-timescaledb-1 timescaledb 2022-07-26 19:55:47 UTC [4547]: [62e046c3.11c3-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000001F00000248000000A1 has already been removed
timescale-redacted-timescaledb-0 timescaledb 2022-07-26 19:55:52 UTC [6251]: [62e046c8.186b-3] standby@[unknown],app=timescale-redacted-timescaledb-1 [58P01] ERROR: requested WAL segment 0000001F00000248000000A1 has already been removed
timescale-redacted-timescaledb-0 timescaledb 2022-07-26 19:55:52 UTC [6250]: [62e046c8.186a-3] standby@[unknown],app=timescale-redacted-timescaledb-2 [58P01] ERROR: requested WAL segment 00000014000002150000005F has already been removed
timescale-redacted-timescaledb-2 timescaledb 2022-07-26 19:55:52 UTC [4722]: [62e046c7.1272-1] @,app= [00000] LOG: started streaming WAL from primary at 215/5F000000 on timeline 20
timescale-redacted-timescaledb-2 timescaledb 2022-07-26 19:55:52 UTC [4722]: [62e046c7.1272-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000014000002150000005F has already been removed
timescale-redacted-timescaledb-1 timescaledb 2022-07-26 19:55:52 UTC [4554]: [62e046c7.11ca-1] @,app= [00000] LOG: started streaming WAL from primary at 248/A1000000 on timeline 31
timescale-redacted-timescaledb-1 timescaledb 2022-07-26 19:55:52 UTC [4554]: [62e046c7.11ca-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000001F00000248000000A1 has already been removed
timescale-redacted-timescaledb-0 timescaledb 2022-07-26 19:55:57 UTC [6261]: [62e046cd.1875-3] standby@[unknown],app=timescale-redacted-timescaledb-2 [58P01] ERROR: requested WAL segment 00000014000002150000005F has already been removed
timescale-redacted-timescaledb-2 timescaledb 2022-07-26 19:55:57 UTC [4737]: [62e046cc.1281-1] @,app= [00000] LOG: started streaming WAL from primary at 215/5F000000 on timeline 20
timescale-redacted-timescaledb-2 timescaledb 2022-07-26 19:55:57 UTC [4737]: [62e046cc.1281-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000014000002150000005F has already been removed
timescale-redacted-timescaledb-0 timescaledb 2022-07-26 19:55:57 UTC [6262]: [62e046cd.1876-3] standby@[unknown],app=timescale-redacted-timescaledb-1 [58P01] ERROR: requested WAL segment 0000001F00000248000000A1 has already been removed
timescale-redacted-timescaledb-1 timescaledb 2022-07-26 19:55:57 UTC [4562]: [62e046cc.11d2-1] @,app= [00000] LOG: started streaming WAL from primary at 248/A1000000 on timeline 31
timescale-redacted-timescaledb-1 timescaledb 2022-07-26 19:55:57 UTC [4562]: [62e046cc.11d2-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000001F00000248000000A1 has already been removed
timescale-redacted-timescaledb-0 timescaledb 2022-07-26 19:56:02 UTC [6280]: [62e046d2.1888-3] standby@[unknown],app=timescale-redacted-timescaledb-2 [58P01] ERROR: requested WAL segment 00000014000002150000005F has already been removed
timescale-redacted-timescaledb-0 timescaledb 2022-07-26 19:56:02 UTC [6281]: [62e046d2.1889-3] standby@[unknown],app=timescale-redacted-timescaledb-1 [58P01] ERROR: requested WAL segment 0000001F00000248000000A1 has already been removed
timescale-redacted-timescaledb-2 timescaledb 2022-07-26 19:56:02 UTC [4744]: [62e046d1.1288-1] @,app= [00000] LOG: started streaming WAL from primary at 215/5F000000 on timeline 20
timescale-redacted-timescaledb-2 timescaledb 2022-07-26 19:56:02 UTC [4744]: [62e046d1.1288-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000014000002150000005F has already been removed
timescale-redacted-timescaledb-1 timescaledb 2022-07-26 19:56:02 UTC [4569]: [62e046d1.11d9-1] @,app= [00000] LOG: started streaming WAL from primary at 248/A1000000 on timeline 31
timescale-redacted-timescaledb-1 timescaledb 2022-07-26 19:56:02 UTC [4569]: [62e046d1.11d9-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000001F00000248000000A1 has already been removed
timescale-redacted-timescaledb-0 timescaledb 2022-07-26 19:56:07 UTC [6287]: [62e046d7.188f-3] standby@[unknown],app=timescale-redacted-timescaledb-2 [58P01] ERROR: requested WAL segment 00000014000002150000005F has already been removed
timescale-redacted-timescaledb-2 timescaledb 2022-07-26 19:56:07 UTC [4752]: [62e046d6.1290-1] @,app= [00000] LOG: started streaming WAL from primary at 215/5F000000 on timeline 20
timescale-redacted-timescaledb-2 timescaledb 2022-07-26 19:56:07 UTC [4752]: [62e046d6.1290-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000014000002150000005F has already been removed
timescale-redacted-timescaledb-0 timescaledb 2022-07-26 19:56:07 UTC [6288]: [62e046d7.1890-3] standby@[unknown],app=timescale-redacted-timescaledb-1 [58P01] ERROR: requested WAL segment 0000001F00000248000000A1 has already been removed
timescale-redacted-timescaledb-1 timescaledb 2022-07-26 19:56:07 UTC [4577]: [62e046d7.11e1-1] @,app= [00000] LOG: started streaming WAL from primary at 248/A1000000 on timeline 31
timescale-redacted-timescaledb-1 timescaledb 2022-07-26 19:56:07 UTC [4577]: [62e046d7.11e1-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000001F00000248000000A1 has already been removed
timescale-redacted-timescaledb-0 timescaledb 2022-07-26 19:56:11 UTC [3529]: [62e04249.dc9-6] @,app= [00000] LOG: checkpoint complete: wrote 17602 buffers (3.4%); 1 WAL file(s) added, 0 removed, 0 recycled; write=269.841 s, sync=0.030 s, total=270.065 s; sync files=100, longest=0.004 s, average=0.001 s; distance=199405 kB, estimate=637217 kB
These WAL segments are not on any of the disks of any of the cluster members.
Additional context
postgres@timescale-redacted-timescaledb-0:~$ patronictl list
+------------------------------------+-------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: timescale-redacted (uninitialized) ---+---------+---------+----+-----------+
| timescale-redacted-timescaledb-0 | 10.0.62.236 | Replica | running | 51 | 0 |
| timescale-redacted-timescaledb-1 | 10.0.65.176 | Replica | running | 30 | 0 |
| timescale-redacted-timescaledb-2 | 10.0.55.166 | Replica | running | 19 | 0 |
+------------------------------------+-------------+---------+---------+----+-----------+
postgres@timescale-redacted-timescaledb-0:~$ patronictl switchover --candidate timescale-redacted-timescaledb-0
Error: This cluster has no master
postgres@timescale-redacted-timescaledb-0:~$ patronictl failover --candidate timescale-redacted-timescaledb-0
Current cluster topology
+------------------------------------+-------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: timescale-redacted (uninitialized) ---+---------+---------+----+-----------+
| timescale-redacted-timescaledb-0 | 10.0.62.236 | Replica | running | 51 | 0 |
| timescale-redacted-timescaledb-1 | 10.0.65.176 | Replica | running | 30 | 0 |
| timescale-redacted-timescaledb-2 | 10.0.55.166 | Replica | running | 19 | 0 |
+------------------------------------+-------------+---------+---------+----+-----------+
Are you sure you want to failover cluster timescale-redacted? [y/N]: y
Failover failed, details: 503, Failover failed
postgres@timescale-redacted-timescaledb-0:~$ curl http://localhost:8008
{"state": "running", "postmaster_start_time": "2022-07-26 20:08:44.132321+00:00", "role": "master", "server_version": 130007, "xlog": {"location": 4716794140944}, "timeline": 51, "dcs_last_seen": 1658866215, "database_system_identifier": "7105149903557316642", "patroni": {"version": "2.1.3", "scope": "timescale-redacted"}}
postgres=# select * from pg_replication_origin_status;
local_id | external_id | remote_lsn | local_lsn
----------+-------------+------------+-----------
(0 rows)
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------
(0 rows)
Other things that have been tried:
- Restarting the StatefulSet.
- Reloading configs and restarting the cluster in
patronictl
. - Scaling the StatefulSet down to 1 member, then restarting it both via Kube and Patroni.
- Manually forcing a
checkpoint
run inpsql
. - Blowing it all up and rebuilding it.
The state of the cluster in Patroni and the missing WAL segments are all I have to go on at this point. The cluster otherwise operates normally, backup data shows up in S3 as expected, the WAL data just continues to leak on disk until it fills to capacity and the cluster dies.
Right after I create this issue I'll be enabling wal_compression
and setting wal_keep_size
to see if that helps with the missing segments.
Also odd is that it's the same two WAL segment IDs that go missing every incarnation of the cluster.
Adding compression to the mix has dramatically slowed the rate of leakage, but it's definitely still leaking.
Hmm, strange. Does the cluster ever show as having a master?
What is is recovery_command
set to on the standbys?
Hmm, strange. Does the cluster ever show as having a master?
Depends on how you ask. Kubernetes will show the labels of one of the pods in the cluster as role=master
, but patronictl
always says everything is a Replica
.
What is is
recovery_command
set to on the standbys?
Do you mean restore_command
? It's the default: /etc/timescaledb/scripts/pgbackrest_archive_get.sh %f "%p"
, as per https://github.com/timescale/helm-charts/blob/83897565445f8f6e27fd847c6d52e794a6b1f553/charts/timescaledb-single/values.yaml#L239
Worth noting that this is how it is right after I stand it up, too, not even inserting data.
Next steps for us are to implement the suggested values for a high throughput cluster and try to sidestep the replication issue entirely:
https://github.com/timescale/helm-charts/blob/master/charts/timescaledb-single/values/high_throughput.example.yaml
I've been banging my head against the wall (no pun intended) too because of this. Glad to see it wasn't just me. Twice my experimental installations have crashed because the WAL volume filled up.
Any workaround/ fix for this issue?
We hit the same error, but were able to fix it with re-initialize the failing replica with the following procedure =>
The cluster topology with the failing replica:
postgres@timescaledb-2:~$ patronictl -c /etc/timescaledb/patroni.yaml topology
+-----------------+---------------+--------------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: osp-staging-timescaledb (7184453855821848609) --+----+-----------+
| timescaledb-0 | 10.179.201.63 | Leader | running | 86 | |
| + timescaledb-1 | 10.179.192.96 | Sync Standby | running | 86 | 0 |
| + timescaledb-2 | 10.179.211.89 | Replica | running | 82 | 104795 |
Running patroni re-init for the failing replica member:
postgres@timescaledb-2:~$ patronictl -c /etc/timescaledb/patroni.yaml reinit osp-staging-timescaledb
+---------------+---------------+--------------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: osp-staging-timescaledb (7184453855821848609) +----+-----------+
| timescaledb-0 | 10.179.201.63 | Leader | running | 86 | |
| timescaledb-1 | 10.179.192.96 | Sync Standby | running | 86 | 0 |
| timescaledb-2 | 10.179.211.89 | Replica | running | 82 | 104795 |
+---------------+---------------+--------------+---------+----+-----------+
Which member do you want to reinitialize [timescaledb-0, timescaledb-1, timescaledb-2]? []: timescaledb-2
Are you sure you want to reinitialize members timescaledb-2? [y/N]: y
Success: reinitialize for member timescaledb-2
postgres@timescaledb-2:~$ ps ax | grep postgres
855 ? S 0:21 pg_basebackup --pgdata=/var/lib/postgresql/data -X stream --dbname=dbname=postgres user=standby host=10.179.201.63 port=5432 --waldir=/var/lib/postgresql/wal/pg_wal
856 ? S 0:00 pg_basebackup --pgdata=/var/lib/postgresql/data -X stream --dbname=dbname=postgres user=standby host=10.179.201.63 port=5432 --waldir=/var/lib/postgresql/wal/pg_wal
postgres@timescaledb-2:~$ patronictl -c /etc/timescaledb/patroni.yaml topology
+-----------------+---------------+--------------+------------------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: osp-staging-timescaledb (7184453855821848609) -----------+----+-----------+
| timescaledb-0 | 10.179.201.63 | Leader | running | 86 | |
| + timescaledb-1 | 10.179.192.96 | Sync Standby | running | 86 | 0 |
| + timescaledb-2 | 10.179.211.89 | Replica | creating replica | | unknown |
+-----------------+---------------+--------------+------------------+----+-----------+
After some time re-init has been done:
postgres@timescaledb-2:~$ patronictl -c /etc/timescaledb/patroni.yaml topology
+-----------------+---------------+--------------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: osp-staging-timescaledb (7184453855821848609) --+----+-----------+
| timescaledb-0 | 10.179.201.63 | Leader | running | 86 | |
| + timescaledb-1 | 10.179.192.96 | Sync Standby | running | 86 | 0 |
| + timescaledb-2 | 10.179.211.89 | Replica | running | 86 | 0 |
+-----------------+---------------+--------------+---------+----+-----------+
postgres@timescaledb-2:~$ ps ax | grep postgres
1344 ? S 0:00 postgres -D /var/lib/postgresql/data --config-file=/var/lib/postgresql/data/postgresql.conf --listen_addresses=0.0.0.0 --port=5432 --cluster_name=osp-staging-timescaledb --wal_level=replica --hot_standby=on --max_connections=100 --max_wal_senders=10 --max_prepared_transactions=150 --max_locks_per_transaction=64 --track_commit_timestamp=off --max_replication_slots=10 --max_worker_processes=8 --wal_log_hints=on
1346 ? Ss 0:00 postgres: osp-staging-timescaledb: startup recovering 00000056000001D500000047
1353 ? Ss 0:00 postgres: osp-staging-timescaledb: checkpointer
1354 ? Ss 0:00 postgres: osp-staging-timescaledb: background writer
1356 ? Ss 0:00 postgres: osp-staging-timescaledb: stats collector
1358 ? Ss 0:00 postgres: osp-staging-timescaledb: walreceiver streaming 1D5/47000148
1365 ? Ss 0:00 postgres: osp-staging-timescaledb: postgres postgres [local] idle
1374 ? Ss 0:00 postgres: osp-staging-timescaledb: postgres postgres [local] idle
1375 ? Ss 0:05 postgres: osp-staging-timescaledb: postgres postgres [local] idle
1383 ? Ss 0:03 postgres: osp-staging-timescaledb: datadog postgres 10.179.208.80(44642) idle
And the error FATAL: could not receive data from WAL stream: ERROR: requested WAL segment xxxxxxxx has already been removed
has gone.
Patroni docs:
https://patroni.readthedocs.io/en/latest/rest_api.html?highlight=patronictl%20reinit#reinitialize-endpoint