promo-tools icon indicating copy to clipboard operation
promo-tools copied to clipboard

k8s.io disaster recovery plan

Open amy opened this issue 6 years ago • 38 comments

Broad issue to track what our disaster recovery plan is if k8s.io registry somehow gets deleted.

One suggestion was creating a backup registry that snapshots k8s.io registry.

amy avatar Jul 10 '19 16:07 amy

There are 2 things: 1.) Does the lack of a disaster recovery plan (aka maintaining the status quo today) prohibit the image promoter from being released to the general public? 2.) Initial brainstorming on possible disaster recovery options.

amy avatar Jul 10 '19 20:07 amy

cc/ @thockin @listx @spiffxp

amy avatar Jul 10 '19 20:07 amy

Does the lack of a disaster recovery plan (aka maintaining the status quo today) prohibit the image promoter from being released to the general public?

IMO it does. I don't want to be over-pedantic but if we don't force ourselves to do it, it won't get done. :(

Brainstorming: This doesn't have to be the most amazing, elegant, automatic thing in the world. It might simply be:

  • A daily job (running where? how do we know if it fails or stops running) that copies all images by SHA to another GCS which has a much smaller set of things that can access it and a strong retention policy. Also snapshots of the promoter YAMLs (the SHA to tag mappings).

  • A program which consumes the snapshot yaml files and promotes the backup images into a GCR, restoring tags to SHAs.

  • A monthly job that runs the restore into a test GCR, generates a log, and then erases it all.

If that is too onerous, what corners can we cut?

thockin avatar Jul 10 '19 21:07 thockin

Isn't GCR just a GCS bucket fronted by a proxy/API? Could the backend bucket just be backed up/copied? Would the GCP storage transfer service be enough? Or a cron'd gsutil sync?

cblecker avatar Jul 10 '19 22:07 cblecker

Not as simple as a dumb bucket-wise copy, but: GCR just stores digests of images. As long as there is a reference to it, it won't get deleted. So, we could copy everything into another GCR, but namespace it under a timestamped folder. E.g. gcr.io/backup/20190701/..., and it won't eat up a ton of storage because Docker already de-dupes things.

And we could also try turning on that lifecycle thing for the underlying GCS (bucket) layer: https://cloud.google.com/storage/docs/lifecycle.

listx avatar Jul 10 '19 22:07 listx

+1 on listx's suggestion. Let's narrow down the solutions so that we can get started and unblock releasing the promotor to the rest of the community.

amy avatar Jul 15 '19 17:07 amy

is backing up the GCS bucket good enough? or do we need to do it at GCR level? copy image by image, tags and so

javier-b-perez avatar Jul 15 '19 17:07 javier-b-perez

@listx What happens if the bucket or project gets deleted?

cblecker avatar Jul 15 '19 18:07 cblecker

@cblecker Can you clarify?

listx avatar Jul 15 '19 21:07 listx

/assign @amy

Please continue the discussion. I'm following the thread & will write up a google doc of some options for next week's meeting.

amy avatar Jul 15 '19 21:07 amy

@listx I guess I'm not clear on your proposed multiple GCR with digests proposal. Isn't the GCR just a fronted underlying GCS bucket scoped to a project?

cblecker avatar Jul 16 '19 00:07 cblecker

@listx I guess I'm not clear on your proposed multiple GCR with digests proposal. Isn't the GCR just a fronted underlying GCS bucket scoped to a project?

Yes. But AFAIK GCS alone does not auto-dedup data. A quick google search led me to https://cloud.google.com/solutions/partners/storreduce-cloud-deduplication which supports my assumption.

Ultimately we would be taking daily(?) snapshots of all the images in k8s.gcr.io. If deduplication is free (via another Docker Registry such as GCR), then we can even take hourly snapshots and it won't matter much.

listx avatar Jul 16 '19 17:07 listx

Short of reaching consensus on the initial backup approach, let's try to identify some invariants.

(1) job duration < 24 hrs: I think we want the backups to happen at least daily. (2) disk usage: because of (1), we really want to de-dupe data. This rules out GCS bucketwise copies (although, one could argue, we could have a rolling window of backed-up snapshots --- e.g. only the last 30 days). (3) restoration: following the spirit of Tim's 2nd bullet point, there needs to be some process that understands how to restore from the backup to an "original" state. Using the prefixed-by-date GCR backup idea, this would be as simple as copying all images from (for example) gcr.io/some-backup-project-name/20190808/... -> {asia,eu,us}.gcr.io/k8s-artifacts-prod/... . There are many options here (it could involve some combination of the promoter's -snapshot flag along with gcrane (gcrane, unlike gcloud (which the promoter currently relies on), can copy images that don't even have a tag). (4) a job that actually runs the restoration: this follows Tim's 3rd bullet point.

Do these points sound reasonable as a first stab at this problem? I think using the promoter's -snapshot flag to generate an easy-to-read YAML inventory of all images in the GCR-to-backup makes sense. These snapshot YAMLs would be stored in GCS (or if we're fancy, in Github). I think the backup "job" should run in Prow (and surely, Prow has some slack alert thing that we can enable for the backup job).

As for where this backup job logic should live --- I'm guessing github.com/kubernetes/k8s.io, or some other k8s repo (and not this promoter repo).

listx avatar Aug 16 '19 00:08 listx

Looks like there is already a GCS disaster recovery script underway here: https://github.com/kubernetes/k8s.io/pull/334. We should probably follow the same infrastructural patterns established there.

listx avatar Aug 16 '19 01:08 listx

The pattern that I'm proposing in #334 is a different script for copying everything, with a no-overwrite / no-delete policy (I implemented that in code, @thockin pointed out that we can probably just use retention policy). (Edit: different as in not reusing the same code that we use for promotion)

However, for registries which naturally de-dup I agree with the suggestions of using a date suffix.

And nice find on gcrane @listx ! How about:

gcrane cp -r gcr.io/k8s-staging-cluster-api-aws gcr.io/backup-dest/k8s-staging-cluster-api-aws/$(date --rfc-3339=date)

Of course, it'll take some time to translate that from bash to a programming language ;-) And while this solution does dedup, it doesn't protect against accidental/malicious tag deletion if someone gets access.

If we do want to protect against that, another option is to rsync the bucket underlying GCR, and then also export the manifests and upload them. This is relatively cheap, and we also can then have a GCS bucket with a retention policy to prevent overwriting.

ID=`date --rfc-3339=date`
gsutil rsync -r gs://artifacts.k8s-staging-cluster-api-aws.appspot.com/containers/images/ gs://backup/containers/images/
mkdir -p tags/${ID}/gcr.io/k8s-staging-cluster-api-aws/
gcrane ls -r gcr.io/k8s-staging-cluster-api-aws | grep -v @sha256 | xargs -I {} bash -c "gcrane manifest {} > tags/${ID}/{}.manifest"
gsutil rsync -r tags/ gs://backup/tags

(This one probably does need some work, because I cheated when creating the directories: it fails on nested images)

The downside is that it isn't trivial to restore from that, and that we're making some assumptions about the structure of GCR. But we could easily bring up a server that serves from this structure - whether that's a temporary one for DR, or because we want some mirrors that don't use GCR. If we're really sneaky, it's even possible to serve direct from GCS I believe.

justinsb avatar Aug 20 '19 00:08 justinsb

The pattern that I'm proposing in #334 is a different script for copying everything, with a no-overwrite / no-delete policy (I implemented that in code, @thockin pointed out that we can probably just use retention policy). (Edit: different as in not reusing the same code that we use for promotion)

However, for registries which naturally de-dup I agree with the suggestions of using a date suffix.

And nice find on gcrane @listx ! How about:

gcrane cp -r gcr.io/k8s-staging-cluster-api-aws gcr.io/backup-dest/k8s-staging-cluster-api-aws/$(date --rfc-3339=date)

Of course, it'll take some time to translate that from bash to a programming language ;-) And while this solution does dedup, it doesn't protect against accidental/malicious tag deletion if someone gets access.

I think it makes sense to just start out with something simple like this. One thing to note here is that the backup GCR will have its own separate service account for write access to the backups. It doesn't buy us a ton of security but it's better than the status quo.

Are there any volunteers for this initial implementation using gcrane to do the copy? It would have to live in a prow job. Please comment!

EDIT: I'd like to clarify that I will take an initial stab at the implementation (you should see a PR this week); I just wanted to see if other people on this thread wanted to chip in. :)

listx avatar Aug 26 '19 20:08 listx

An additional thought: I think it makes sense for the backup GCR to additionally mirror the latest snapshot of the prod GCR. This way, we could just redirect the vanity domain k8s.gcr.io to point to the backup GCR in case the prod GCR gets hosed, so that we don't have to wait for the backfill process to finish (there would be very minimal downtime).

The one slightly ugly part is that now the backup GCR looks ilke this:

gcr.io/<backup-project-name>/foo-img
gcr.io/<backup-project-name>/bar-img
gcr.io/<backup-project-name>/...
gcr.io/<backup-project-name>/backups/<DATE>/...

where the backups folder would take up a name that the new prod GCR must not have (it's a sort of reserved name). But I think this is minor/negligible.

I suppose the missing piece here is that the backup GCR has to be made smart enough to only mirror good states (i.e., if an attacker re-tags all images, we don't want the backup mirror to do the same --- there would have to be some sort of delta heuristic for the backup process to detect and know when not to mirror false positive states of the original).

listx avatar Aug 28 '19 19:08 listx

Are there any thoughts about using the promoter directly for performing backups? We should be able to do this once #118 is merged.

The backup process would be:

  1. Construct a "backup" Promoter manifest. We can use the -snapshot flag to record all reachable images in a repo. (This output is 99% of a regular Promoter manifest, minus the registries: field).
  2. Promote all images in the backup manifest with a rebased name, prefixed by date:
registries:
- name: gcr.io/k8s-cip-test-prod
  service-account: [email protected]
  src: true
# Same for all the regions for multi-regional backups.
- name: us.gcr.io/k8s-cip-test-prod/<DATE>
  service-account: [email protected]
  1. Save the backup manifest to a GCS bucket (or Github or somewhere else). Saving it in Github would be nice because of the easier discoverability and change history.
  2. Repeat the above steps daily.

I think steps 1 and 2 can be glued together with either a shell script or Go binary (we already have the framework for this sort of "glue" code in our e2e tests, so we can reuse the code there if we decide to use Go instead of bash).

I think this is 1/2 of Disaster Recovery. The other 1/2 would be the Restoration process that restores backed-up images to a test GCR. This is actually pretty similar to the other half:

  1. Promote all images in the backup (the backup will already have a list of snapshot YAMLs by date) to the target GCR (in this case, the test GCR).
  2. Take a snapshot of the test GCR and ensure that it matches with the snapshot we used for the promotion (this is the same approach we use in our e2e tests).
  3. Delete the test GCR.

I think the only missing piece is some easy way of making the promoter promote directly from a snapshot YAML, by allowing the user to supply the missing registries: field dynamically as CLI arguments or ENV vars or some other.

listx avatar Aug 30 '19 22:08 listx

gcrane cp -r should work for this

jonjohnsonjr avatar Sep 19 '19 01:09 jonjohnsonjr

I am working on a doc to sum everything up + an initial implementation. Will share with this thread soon... stay tuned!

listx avatar Sep 20 '19 21:09 listx

Here is a writeup of an initial approach/design: https://docs.google.com/document/d/1od5y-Z2xP9mVmg2Yztnv-GQ7D-orj9HsTmeVvNHkzzA/edit?usp=sharing

Mailing list link: https://groups.google.com/d/msg/kubernetes-wg-k8s-infra/cseCwgALwdk/iOYkaEYFCAAJ

You must be a member of the kubernetes-wg-k8s-infra Google group in order to access the document.

listx avatar Sep 23 '19 20:09 listx

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Dec 22 '19 21:12 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Jan 21 '20 22:01 fejta-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

fejta-bot avatar Feb 20 '20 23:02 fejta-bot

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 20 '20 23:02 k8s-ci-robot

Hi @listx what is the status of this issue? Do you think we should have it open?

bartsmykla avatar Mar 06 '20 06:03 bartsmykla

/remove-lifecycle rotten /reopen

Reopening to give @listx a chance to triage.

cblecker avatar Mar 07 '20 04:03 cblecker

@cblecker: Reopened this issue.

In response to this:

/remove-lifecycle rotten /reopen

Reopening to give @listx a chance to triage.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 07 '20 04:03 k8s-ci-robot

Yup this should stay open. There needs to be more docs around how to handle the scenario of (omg the prod GCR is messed up, how do I recover?).

/assign @listx

listx avatar Mar 09 '20 18:03 listx

/unassign @amy

listx avatar Mar 09 '20 18:03 listx