spilo/patroni not able to elect new leader if previous leader, last working member failed due to full disk?

Open joar opened this issue 7 years ago • 0 comments

Scenario

GKE Kubernetes

spilo Pods via StatefulSet: patroni-set-0003

kind: StatefulSet
# [...]
metadata:
  name: patroni-set-0003
spec:
  replicas: 3
  # [...]
  template:
    spec:
      containers:
        - name: spilo
          # [...]
          env:
            - name: SCOPE
              value: the-scope
          volumeMounts:
            - mountPath: /home/postgres/pgdata
              name: pgdata
  volumeClaimTemplates:
    - metadata:
        name: pgdata

      spec:
        # [...]
        resources.requests.storage: 500Gi

Unfortunately, /home/postgres/pgdata ran out of space (in all pods, it seems, probably almost simultaneously) and spilo/patroni started logging:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/patroni/async_executor.py", line 39, in run
    wakeup = func(*args) if args else func()
  File "/usr/local/lib/python3.5/dist-packages/patroni/postgresql.py", line 1067, in _do_follow
    self.write_recovery_conf(primary_conninfo)
  File "/usr/local/lib/python3.5/dist-packages/patroni/postgresql.py", line 911, in write_recovery_conf
    f.write("{0} = '{1}'\n".format(name, value))
OSError: [Errno 28] No space left on device

I believe the last leader before all pods went out of disk was either patroni-set-0003-1 or patroni-set-0003-2.

Recovery

In order to solve the issue I:

Scaled down patroni-set-0003 to 1 replica (still failing with OSError: No space left on device), Note that this will leave me without any running old leader, broken or not, I believe this could be a key to my issue.

Created a new StatefulSet, patroni-set-0004, with the same configuration as patroni-set-0003 except

spec.metadata.name: patroni-set-0004
spec.replicas: 1
spec.volumeClaimTemplates[0].spec.resources.requests: 1Ti

With only the broken patroni-set-0003-0 running, patroni-set-0004-0 started restoring from WAL archive, I left it overnight to restore. During this time both patroni-set-0003-0 and patroni-set-0004-0 were running, but patroni-set-0003-0` was out of disk.

Several hours later, patroni-set-0004-0 was logging lots of:

following a different leader because i am not the healthiest node
Lock owner: None; I am patroni-set-0004-0
wal_e.blobstore.gs.utils WARNING MSG: could no longer locate object while performing wal restore
DETAIL: The absolute URI that could not be located is gs://the-bucket/spilo/the-scope/wal/wal_005/the-file.lzo.
HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
STRUCTURED: time=2017-06-26T12:05:23.646236-00 pid=207
lzop: <stdin>: not a lzop file
[...]

I expected patroni-set-0004-0 to take over the master lock by this time.

Debugging why the disk outage occured, I found out about ext Reserved blocks, I then recovered 25Gi of disk space on patroni-set-0003-0's pgdata by running tune2fs -m 0 /dev/$PGDATA_DEV. I realize in hindsight that simply resizing the GCE PD would have been easier.

However, once patroni-set-0003-0 was given extra space and restarted, it did not seem willing to take the leader role even given the extra disk space and no current leader, logging lots of:

Lock owner: None; I am patroni-set-0003-0
wal_e.blobstore.gs.utils WARNING MSG: could no longer locate object while performing wal restore
DETAIL: The absolute URI that could not be located is gs://the-bucket/spilo/the-scope/wal/wal_005/the-file.lzo.
HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
STRUCTURED: time=2017-06-26T12:05:23.646236-00 pid=207
lzop: <stdin>: not a lzop file
[...]

I expected patroni-set-0003-0 to take the leader role by this time.

I then did the same thing to patroni-set-0003-{1,2}, freeing up 25Gi of space.

Once patroni-set-0003-1 was given extra disk space and restarted it took the master lock.

Jun 27 '17 12:06 joar

spilo spilo copied to clipboard

spilo/patroni not able to elect new leader if previous leader, last working member failed due to full disk?

Scenario

Recovery

spilo
spilo copied to clipboard