modernisation-platform
modernisation-platform copied to clipboard
Create runbook for loss of a platform components
User Story
As an MP engineer I need to know what to do if a component of the platform, eg networking component or AWS account, disappears. So that I can recreate it
User Type(s)
Value
Questions / Assumptions / Hypothesis
Definition of done
- [ ] readme has been updated
- [ ] user docs have been updated
- [ ] another team member has reviewed
- [ ] tests are green
- [ ] UR test OR added to continual research plan
Reference
This issue is stale because it has been open 90 days with no activity.
This issue is stale because it has been open 90 days with no activity.
https://user-guide.modernisation-platform.service.justice.gov.uk/runbooks/dr-process.html#priority-list - Priority List
https://user-guide.modernisation-platform.service.justice.gov.uk/runbooks/dr-process.html#single-account - Single account
I think @ep-93 has covered a lot of this, but I would see the following as platform components that would need to be reconstituted in the event of a region loss.
-
Modernisation Platform
AWS account- Contains components necessary to run the platform
- Some of these are either global, or replicated to different regions
- KMS keys, S3 buckets, DynamoDB, used for Terraform state
-
core-logging-production
AWS account- S3 buckets used for storing logs & attendant KMS keys already appear to be replicated into
eu-west-1
- S3 buckets used for storing logs & attendant KMS keys already appear to be replicated into
-
core-network-services-production
AWS account- Transit Gateway would need to be created in new region
- external-inspection VPC & Network Firewall for communication with internal MOJ networks
- non_live_data / live_data VPCs & Network Firewalls for communication between platform & internet
-
core-shared-services-production
AWS account- non_live_data / live_data VPCs used for EC2 Image Builder pipelines, and for some shared customer infrastructure (Active Directory controllers).
- AMI images created through pipelines & used by infrastructure will be region-bound.
- instance scheduler lambda (not critical, but has region-specific image URI).
- shared KMS keys offered to customers
-
core-vpc-$environment
AWS accounts-
$environment
-$business_unit
VPCs (required to share resources out to member accounts) - AWS Backups provided through baselines; are these meant to allow customers to recreate instances? If we duplicate backups to a separate region, can we be sure they'll work for customers?
-
These two aren't platform components specifically, but would also need consideration:
-
MOJ Master
account- AWS Organizations Service Control Policies restricting use of AWS outside of
eu-west-2
- AWS Organizations Service Control Policies restricting use of AWS outside of
-
MOJ Official (Production)
account- Contains Transit Gateways. We'd need to peer a new TGW with an existing TGW in this account for internal connectivity with the rest of the MOJ
With regards our KMS keys, the answer here might be to look more deeply into the provision of kms_replica_key resources, as also discussed here.
With regards to AWS Backup, we can also duplicate them into a separate region:
resource "aws_backup_plan" "replica" {
...
rule {
copy_action {
destination_vault_arn = "arn:aws:backup:*:*:backup-vault:replica"
lifecycle {}
}
}
}
So I think this gives us the following runbooks in need of creation for platform components:
- [ ] Modernisation Platform account
- [ ] AWS resources (IAM roles, accounts, secrets)
- [ ] Resources used by Terraform (s3, dynamoDB)
- [ ]
core-logging-production
account- [ ] S3 bucket for logs
- [ ]
core-network-services
account- [ ] VPCs w/ Network Firewalls (and NAT gateways for egress)
- [ ] Transit Gateway (and peering to MOJ TGW)
- [ ]
core-shared-services
account- [ ] VPCs for shared infrastructure
- [ ] AMI builder & resources (eg, S3 bucket)
- [ ] Instance Scheduler
- [ ] Shared KMS keys
- [ ]
core-vpc-$environment
accounts- [ ] VPCs
- [ ] RAM shares, VPC endpoints