lego
lego copied to clipboard
Issues creating certificates for subdomain with route53
I have been trying to create a certificate using letsencrypt and route53 the certificate im trying to create is for 'server.sub.domain.com' when trying to use route53 it I get an error saying that it can not find the host zone id for sub.domain.com, I belive that is a bug as the domain it should be looking for is domain.com, and that does exist, there is no issues creating certificates for that domain.
I have also tested it with cloudflare for another domain and that works perfectly, so I belive that the problem is when the api call towards route53
It's happening to me as well with v3.7.0.
After some debugging, it looks to me that what is happening is:
- code flow somehow end up in
route53.DNSProvider.Present()at the start of the challenge - call to
d.getHostedZoneID(fqdn)to figure out what the name of the hosted zone is - call graph continue to
dns01.fetchSoaByFqdn()to perform recursive DNS call to see if there is a SOA record. For example forfoo.bar.example.org, it will queryfoo.bar.example.orgthenbar.example.orgthenexample.org, which should have this SOA record. - the problem happen when in
dns01.fetchSoaByFqdn()a DNS query has a temporary failure (say, a timeout). This error is not handled there, it just skip the node in the domain. - if this failure happen at the domain that should have the SOA record (
example.org), the function will end up returningorginstead of `example.org - later, the AWS SDK call to find the Route53 hosted zone by name (
ListHostedZonesByName) will be called withorginstead ofexample.organd fail
So to me this is not a Route53 provider failure, this is a dns01 one.
If you have an intermittent timeout, I think you should check your network and its configuration, and the nameservers that you are using.
You can also simply configure the DNS timeout.
--dns-timeout
https://go-acme.github.io/lego/usage/cli/#usage
it does not look to be a timeout issue, it is only when the host name is on a subdomain, I am updating certificates on both sides of the failures for the main domain and it is only the lego client that fails, I have no issues when using https://github.com/acmesh-official/acme.sh
If you have an intermittent timeout, I think you should check your network and its configuration, and the nameservers that you are using.
Sure, but that doesn't remove the fact that fetchSoaByFqdn doesn't handle well this kind of failure. This is especially a problem because if I'm not mistaken, DNS happen on UDP, that is without any guarantee of packet delivery.
Here is a screenshot where one of those failure happen:

In that case, the node being currently checked will be silently dropped and the function can return an incorrect result, that will cascade later in a bigger problem (complete failure of the certificate issuance).
@armsby I got the exact same error message as you, that's even how I found this issue. The root cause might be something else than a timeout but if an error happen when doing a DNS query, you can eventually end up with this final error.
@MichaelMure so your problem is a timeout so you can use change the dnsTimeout:
client.Challenge.SetDNS01Provider(provider,dns01.AddDNSTimeout(30*time.Second))
or
--dns-timeout
I understand that but that's only a band-aid on this problem. Networking is unreliable by nature, especially UDP. A DNS request can fail for different reasons and the code doing those requests should handle those errors properly if possible.
For me, the best way to handle timeout error is to configure dnsTimeout: this option is only for that, it's not band-aid.
What if the UDP packet simply get lost or dropped somewhere on an unreliable connection? No amount of timeout will fix that and it will still show up as a timeout X minutes later.
https://github.com/go-acme/lego/blob/1a82effaaac7f32b53b9920455a477b3364c2174/challenge/dns01/nameserver.go#L255-L266
https://github.com/go-acme/lego/blob/1a82effaaac7f32b53b9920455a477b3364c2174/challenge/dns01/nameserver.go#L259-L263
Note: I certainly don't want to start an argument and as a free software maintainer myself I know that sometimes people get ... inconsiderate. But we should be able to agree on how the code behave.
My understanding of the code section you linked is that a TCP DNS query will be done as a fallback if the UDP reply is too big. But that implies having a valid UDP response so that doesn't handle a packet loss.
edit: this happen when the reply is > 512bits: https://serverfault.com/questions/587625/why-dns-through-udp-has-a-512-bytes-limit
Yes if not a fallback (I know the Truncated meaning) but it's not a simple DNS call.
Otherwise, create a fix without any information to reproduce the issue and create a blind fix seems to me not a good way to follow. I can create a retry system but I need to understand why (currently, UDP by it-self is not enough for me)
Ha I see.
Well, I do not know why this particular DNS query fail so often for me, I have an otherwise reliable internet connection. Maybe it's because the certificates I'm trying to generate have a lot of nodes (it's in the form of *.foo.bar.fuu.boo.example.org) ? Or maybe I'm just more exposed to this problem because I generate a bunch of those certs in a row.
In any case, the dns.Client.Exchange() function's doc state that:
// Exchange does not retry a failed query, nor will it fall back to TCP in // case of truncation.
To me it implies that the possible failure is left to the caller to handle.
I can of course test whatever solution you come up with and see if that fix the problem.
I will trying to create a retry system.
Thank you :)
@MichaelMure could you try https://github.com/go-acme/lego/pull/1180 ?
I'll give it a try tomorrow. That looks like a good solution.
I'm working from home today and I just don't get any timeout from there. I'll try again from this other place that apparently have less than optimal networking.