hedera-mirror-node icon indicating copy to clipboard operation
hedera-mirror-node copied to clipboard

Citus automated backup

Open steven-sheehy opened this issue 2 years ago • 4 comments

Problem

We need to ensure our Citus installation has automated backups.

Solution

  • Investigate Stackgres backup functionality
  • Automate backup with multi-node setup.
  • Document manual restore process in database.md
  • Update existing citus.md and database.md to make sure the backup restore process is documented for the correct approach

Alternatives

steven-sheehy avatar Jan 30 '23 15:01 steven-sheehy

the manual backup approach has been documented in citus.md

jnels124 avatar May 10 '24 16:05 jnels124

The restore side of this is currently blocked by upstream issue.

jnels124 avatar May 22 '24 20:05 jnels124

Issues / findings from testing stackgres

  • pg basebackup isn't feasible for large database since every base backup takes too long to create and consumes too much storage
  • using volumesnapshot as the base backup took much longer than expected, 20+ minutes for a database with less than 1GB data
  • creating volumesnapshot sometimes can fail consecutively, however with different errors
    • first Failed to create snapshot content with error snapshot controller failed to update mirror-citus-coord-data-mirror-citus-coord-0 on API server: Operation cannot be fulfilled on persistentvolumeclaims \"mirror-citus-coord-data-mirror-citus-coord-0\": the object has been modified; please apply your changes to the latest version and try again
    • subsequently Error from server (AlreadyExists): error when creating "STDIN": volumesnapshots.snapshot.storage.k8s.io "manual-test-coord" already exists
  • Can't turn off continuous archiving which enables PITR, it's a log of WAL segments to backup under high TPS

xin-hedera avatar May 23 '24 18:05 xin-hedera

The restore side of this is currently blocked by upstream issue.

@jnels124 Can you share the steps and exact properties of the volumes used on how to go about reproducing this issue? This doesn't seem to hit in the e2e testing for openebs/zfs-localpv.

Abhinandan-Purkait avatar May 31 '24 13:05 Abhinandan-Purkait