docs Mitigating Let's Encrypt Rate Limiting Issues

Mitigating Let's Encrypt Rate Limiting Issues

Open osterman opened this issue 7 years ago • 1 comments

what

We're concerned about LetsEncrypt rate limiting issues. It's fair enough to switch our staging environment over to using Lets Encrypt's staging env, but I'm concerned about this in production.

why

It basically means we could be blocked from changes to our infrastructure if let's encrypt rate limits us again. So we need a solution to that in some respect. Naively we could switch to using a wildcard cert. *.example.net and just make sure all of the servers use the dns name of server-123-123.example.net

Jul 14 '18 00:07 osterman

There are a few options.

option 1

Use an ACM certificate provisioned with terraform and associated with the nginx-ingress.

https://github.com/cloudposse/terraform-aws-acm-request-certificate

Reference implementation here: https://github.com/cloudposse/terraform-root-modules/tree/master/aws/acm

Then set the ingress annotations to leverage this ACM certificate (e.g. SAN for *.ourapp.us-west-2.staging.example.net, ourapp.us-west-2.staging.example.net)

AWS Service annotations

service.beta.kubernetes.io/aws-load-balancer-ssl-cert (IAM or ACM ARN) (via: https://gist.github.com/mgoodness/1a2926f3b02d8e8149c224d25cc57dc1)

These are passed to the Helm chart in the helmfile.yaml https://github.com/cloudposse/geodesic/blob/master/rootfs/conf/kops/helmfile.yaml#L556-L557

option 2

Use a different operational domain for production to reduce sharing across stages. E.g. treat example.net as a staging domain and example.co as the production operations domain. This is what another one of our customers do. They incidentally use ACM certs as well, but only because we started this journey before kube-lego existed

other considerations

The likelihood of getting rate limited in production is small for a few reasons:

Very few new services are launched
Namespaces are seldom, if ever, destroyed
certificates are still long-lived so requests to APIs are few and far between. They can be renewed earlier than the 90 day cut off and rate limits would have to be in effect for several days for it to utlimately fail or timeout.

The reason you're at elevated risk in staging is due to the large number of publically exposed services as a result of running "unlimited staging environments". By moving staging to the staging domain of Let's Encrypt, the risks of inducingn rate limits in production. By using an entirely separate domain in production, the impact is even further mitigated.

Jul 14 '18 00:07 osterman

docs docs copied to clipboard

Mitigating Let's Encrypt Rate Limiting Issues

what

why

option 1

AWS Service annotations

option 2

other considerations

docs
docs copied to clipboard