docs
docs copied to clipboard
Mitigating Let's Encrypt Rate Limiting Issues
what
We're concerned about LetsEncrypt rate limiting issues. It's fair enough to switch our staging environment over to using Lets Encrypt's staging env, but I'm concerned about this in production.
why
It basically means we could be blocked from changes to our infrastructure if let's encrypt rate limits us again. So we need a solution to that in some respect. Naively we could switch to using a wildcard cert. *.example.net and just make sure all of the servers use the dns name of server-123-123.example.net
There are a few options.
option 1
Use an ACM certificate provisioned with terraform and associated with the nginx-ingress.
https://github.com/cloudposse/terraform-aws-acm-request-certificate
Reference implementation here: https://github.com/cloudposse/terraform-root-modules/tree/master/aws/acm
Then set the ingress annotations to leverage this ACM certificate (e.g. SAN for *.ourapp.us-west-2.staging.example.net, ourapp.us-west-2.staging.example.net)
AWS Service annotations
service.beta.kubernetes.io/aws-load-balancer-ssl-cert(IAM or ACM ARN) (via: https://gist.github.com/mgoodness/1a2926f3b02d8e8149c224d25cc57dc1)
These are passed to the Helm chart in the helmfile.yaml
https://github.com/cloudposse/geodesic/blob/master/rootfs/conf/kops/helmfile.yaml#L556-L557
option 2
Use a different operational domain for production to reduce sharing across stages. E.g. treat example.net as a staging domain and example.co as the production operations domain. This is what another one of our customers do. They incidentally use ACM certs as well, but only because we started this journey before kube-lego existed
other considerations
The likelihood of getting rate limited in production is small for a few reasons:
- Very few new services are launched
- Namespaces are seldom, if ever, destroyed
- certificates are still long-lived so requests to APIs are few and far between. They can be renewed earlier than the 90 day cut off and rate limits would have to be in effect for several days for it to utlimately fail or timeout.
The reason you're at elevated risk in staging is due to the large number of publically exposed services as a result of running "unlimited staging environments". By moving staging to the staging domain of Let's Encrypt, the risks of inducingn rate limits in production. By using an entirely separate domain in production, the impact is even further mitigated.