cloud-platform icon indicating copy to clipboard operation
cloud-platform copied to clipboard

Spike: Convert the Live load-balancer to an Application Load Balancer (ALB)

Open jasonBirchall opened this issue 2 years ago • 1 comments

Background

On 11th July 2022, users start reporting a portion of their traffic is not responding or timing out. The previous day (10/07/22) saw AWS restart a large number of nodes due to a “thermal incident” (TAM). This new report resulted from a network interface in the live-1 VPC responding successfully to around 75% of traffic (rough estimate). This caused a degradation for all services in the Cloud Platform for approximately 6 hours and 37 minutes.

We must find a way to mitigate this in the future.

Source: https://docs.google.com/document/d/1QR31_9Ga_LdXSzgoFjiemE-jxq5sf59rKj5gAoNTU9E/edit#

Following a post-incident review, the team came up with the following action:

Use an ALB instead of our Network Load Balancer. The assumption is that we can omit a failed interface from the load balancer until it is repaired.

Proposed user journey

As a Cloud Platform developer, we want to switch the current NLB with an ALB (in place, if possible). This will allow us to mitigate interface failures in the future.

Approach

  • [ ] Try and switch out the default NLB load balancer with an ALB (if not create a new one).
  • [ ] Confirm that by doing so a team member can remove/delete an interface either through the console or via the CLi.
  • [ ] Demonstrate to the team in the form of a presentation.
  • [ ] Provide guidance and opinion on the next steps.

Which part of the user docs does this impact

None that I can think of.

Questions / Assumptions

[assumption] this is possible without downtime

[assumption] that you can in fact remove an interface in an ALB

Definition of done

  • [ ] The default ingress controller is using an ALB, not an NLB
  • [ ] All assumptions/questions have been answered.
  • [ ] The process is demonstrated to the team in the form of a presentation.
  • [ ] An opinion is given as to the next steps.

Reference

How to write good user stories

jasonBirchall avatar Jul 21 '22 11:07 jasonBirchall

@vijay-veeranki https://github.com/ministryofjustice/cloud-platform-terraform-alb-ingress-controller

razvan-moj avatar Aug 31 '22 10:08 razvan-moj

https://docs.google.com/document/d/1TAlL2QorgwWeFMB-saSBhLZrgzEE2FFMcRIjM2ShuGY/edit#

vijay-veeranki avatar Sep 16 '22 10:09 vijay-veeranki