cloud-platform
cloud-platform copied to clipboard
Spike: Convert the Live load-balancer to an Application Load Balancer (ALB)
Background
On 11th July 2022, users start reporting a portion of their traffic is not responding or timing out. The previous day (10/07/22) saw AWS restart a large number of nodes due to a “thermal incident” (TAM). This new report resulted from a network interface in the live-1 VPC responding successfully to around 75% of traffic (rough estimate). This caused a degradation for all services in the Cloud Platform for approximately 6 hours and 37 minutes.
We must find a way to mitigate this in the future.
Source: https://docs.google.com/document/d/1QR31_9Ga_LdXSzgoFjiemE-jxq5sf59rKj5gAoNTU9E/edit#
Following a post-incident review, the team came up with the following action:
Use an ALB instead of our Network Load Balancer. The assumption is that we can omit a failed interface from the load balancer until it is repaired.
Proposed user journey
As a Cloud Platform developer, we want to switch the current NLB with an ALB (in place, if possible). This will allow us to mitigate interface failures in the future.
Approach
- [ ] Try and switch out the default NLB load balancer with an ALB (if not create a new one).
- [ ] Confirm that by doing so a team member can remove/delete an interface either through the console or via the CLi.
- [ ] Demonstrate to the team in the form of a presentation.
- [ ] Provide guidance and opinion on the next steps.
Which part of the user docs does this impact
None that I can think of.
Questions / Assumptions
[assumption] this is possible without downtime
[assumption] that you can in fact remove an interface in an ALB
Definition of done
- [ ] The default ingress controller is using an ALB, not an NLB
- [ ] All assumptions/questions have been answered.
- [ ] The process is demonstrated to the team in the form of a presentation.
- [ ] An opinion is given as to the next steps.
Reference
@vijay-veeranki https://github.com/ministryofjustice/cloud-platform-terraform-alb-ingress-controller
https://docs.google.com/document/d/1TAlL2QorgwWeFMB-saSBhLZrgzEE2FFMcRIjM2ShuGY/edit#