modernisation-platform icon indicating copy to clipboard operation
modernisation-platform copied to clipboard

Create runbook for loss of a platform components

Open davidkelliott opened this issue 2 years ago • 2 comments

User Story

As an MP engineer I need to know what to do if a component of the platform, eg networking component or AWS account, disappears. So that I can recreate it

User Type(s)

Value

Questions / Assumptions / Hypothesis

Definition of done

  • [ ] readme has been updated
  • [ ] user docs have been updated
  • [ ] another team member has reviewed
  • [ ] tests are green
  • [ ] UR test OR added to continual research plan

Reference

How to write good user stories

davidkelliott avatar Jun 16 '22 13:06 davidkelliott

This issue is stale because it has been open 90 days with no activity.

github-actions[bot] avatar Dec 16 '22 01:12 github-actions[bot]

This issue is stale because it has been open 90 days with no activity.

github-actions[bot] avatar Sep 07 '23 01:09 github-actions[bot]

https://user-guide.modernisation-platform.service.justice.gov.uk/runbooks/dr-process.html#priority-list - Priority List

https://user-guide.modernisation-platform.service.justice.gov.uk/runbooks/dr-process.html#single-account - Single account

ep-93 avatar Mar 28 '24 09:03 ep-93

I think @ep-93 has covered a lot of this, but I would see the following as platform components that would need to be reconstituted in the event of a region loss.

  • Modernisation Platform AWS account
    • Contains components necessary to run the platform
    • Some of these are either global, or replicated to different regions
    • KMS keys, S3 buckets, DynamoDB, used for Terraform state
  • core-logging-production AWS account
    • S3 buckets used for storing logs & attendant KMS keys already appear to be replicated into eu-west-1
  • core-network-services-production AWS account
    • Transit Gateway would need to be created in new region
    • external-inspection VPC & Network Firewall for communication with internal MOJ networks
    • non_live_data / live_data VPCs & Network Firewalls for communication between platform & internet
  • core-shared-services-production AWS account
    • non_live_data / live_data VPCs used for EC2 Image Builder pipelines, and for some shared customer infrastructure (Active Directory controllers).
    • AMI images created through pipelines & used by infrastructure will be region-bound.
    • instance scheduler lambda (not critical, but has region-specific image URI).
    • shared KMS keys offered to customers
  • core-vpc-$environment AWS accounts
    • $environment-$business_unit VPCs (required to share resources out to member accounts)
    • AWS Backups provided through baselines; are these meant to allow customers to recreate instances? If we duplicate backups to a separate region, can we be sure they'll work for customers?

dms1981 avatar Apr 10 '24 13:04 dms1981

These two aren't platform components specifically, but would also need consideration:

  • MOJ Master account
    • AWS Organizations Service Control Policies restricting use of AWS outside of eu-west-2
  • MOJ Official (Production) account
    • Contains Transit Gateways. We'd need to peer a new TGW with an existing TGW in this account for internal connectivity with the rest of the MOJ

dms1981 avatar Apr 10 '24 13:04 dms1981

With regards our KMS keys, the answer here might be to look more deeply into the provision of kms_replica_key resources, as also discussed here.

dms1981 avatar Apr 10 '24 15:04 dms1981

With regards to AWS Backup, we can also duplicate them into a separate region:

resource "aws_backup_plan" "replica" {
  ...
  rule {
    copy_action {
      destination_vault_arn = "arn:aws:backup:*:*:backup-vault:replica"
      lifecycle {}
    }
  }
}

dms1981 avatar Apr 11 '24 14:04 dms1981

So I think this gives us the following runbooks in need of creation for platform components:

  • [ ] Modernisation Platform account
    • [ ] AWS resources (IAM roles, accounts, secrets)
    • [ ] Resources used by Terraform (s3, dynamoDB)
  • [ ] core-logging-production account
    • [ ] S3 bucket for logs
  • [ ] core-network-services account
    • [ ] VPCs w/ Network Firewalls (and NAT gateways for egress)
    • [ ] Transit Gateway (and peering to MOJ TGW)
  • [ ] core-shared-services account
    • [ ] VPCs for shared infrastructure
    • [ ] AMI builder & resources (eg, S3 bucket)
    • [ ] Instance Scheduler
    • [ ] Shared KMS keys
  • [ ] core-vpc-$environment accounts
    • [ ] VPCs
    • [ ] RAM shares, VPC endpoints

dms1981 avatar Apr 12 '24 10:04 dms1981