backup-utils icon indicating copy to clipboard operation
backup-utils copied to clipboard

Enabling concurrent `ghe-backup` tasks from separate backup hosts

Open taz opened this issue 5 years ago • 3 comments

Consider a scenario where two backup hosts exist:

  • One for regular snapshots, situated in the same datacenter as the appliance.
  • One for taking less frequent off-site backups (where copying the snapshots from the host above is not viable)

There's currently no check in ghe-backup which would prevent the simultaneous execution from occurring from different hosts, however it looks like there's a least one place where a semaphore file placed on the appliance (to suspend repository maintenance) may be prematurely removed if the backup tasks were to overlap during specific phases: https://github.com/github/backup-utils/blob/master/share/github-backup-utils/ghe-gc-enable#L35. For example if:

  1. Backup A started the repository backup phase and creates a semaphore file.
  2. The background maintenance queue on GitHub Enterprise is suspended.
  3. Backup B commences its backup and overwrites the same semaphore file.
  4. Backup A completes its repository backup phase and removes the semaphore file.
  5. The background maintenance queue on GitHub Enterprise starts draining.
  6. Repositories data is potentially modified by a maintenance task.
  7. Backup B completes its repository backup phase, there is no semaphore file to remove.

Is this the only place where such a condition exists? If so, would it be possible to keep a count of the active backup tasks instead and remove the file only when it reaches 0? Or are there other considerations / complexities to take into account which makes the approach unpredictable or unreliable?

taz avatar Sep 25 '18 02:09 taz

Is this the only place where such a condition exists? If so, would it be possible to keep a count of the active backup tasks instead and remove the file only when it reaches 0?

This is an interesting idea, and off the top of my head, I think this is the only location where we're putting in place any sort of locking/semaphore on the appliance side of things. If I'm remembering correctly, keeping a count is probably a good solution.

lildude avatar Sep 27 '18 17:09 lildude

Or are there other considerations / complexities to take into account which makes the approach unpredictable or unreliable?

A counter may not be sufficient. For example, if a backup was interrupted and the counter wasn't reduced, then the appliance would continue to believe it was running.

Adding the hostname (or some other identifying details) may help to detect collisions and resolve them more logically. For example, in the above case if the interrupted backup was rerun, it could "add or replace" it's hostname to the list, then know to "remove" its hostname when complete.

Extending this to hostname+Backup PID would also account for accidental overlap of multiple backup runs from the same host.

As a side effect, if the the semaphore file was left behind, it would identify the backup host that did so.

kathodos avatar May 24 '19 07:05 kathodos