lego icon indicating copy to clipboard operation
lego copied to clipboard

Issues creating certificates for subdomain with route53

Open armsby opened this issue 5 years ago • 23 comments

I have been trying to create a certificate using letsencrypt and route53 the certificate im trying to create is for 'server.sub.domain.com' when trying to use route53 it I get an error saying that it can not find the host zone id for sub.domain.com, I belive that is a bug as the domain it should be looking for is domain.com, and that does exist, there is no issues creating certificates for that domain.

I have also tested it with cloudflare for another domain and that works perfectly, so I belive that the problem is when the api call towards route53

armsby avatar Nov 12 '19 12:11 armsby

It's happening to me as well with v3.7.0.

MichaelMure avatar Jun 02 '20 14:06 MichaelMure

After some debugging, it looks to me that what is happening is:

  • code flow somehow end up in route53.DNSProvider.Present() at the start of the challenge
  • call to d.getHostedZoneID(fqdn) to figure out what the name of the hosted zone is
  • call graph continue to dns01.fetchSoaByFqdn() to perform recursive DNS call to see if there is a SOA record. For example for foo.bar.example.org, it will query foo.bar.example.org then bar.example.org then example.org, which should have this SOA record.
  • the problem happen when in dns01.fetchSoaByFqdn() a DNS query has a temporary failure (say, a timeout). This error is not handled there, it just skip the node in the domain.
  • if this failure happen at the domain that should have the SOA record (example.org), the function will end up returning org instead of `example.org
  • later, the AWS SDK call to find the Route53 hosted zone by name (ListHostedZonesByName) will be called with org instead of example.org and fail

MichaelMure avatar Jun 02 '20 16:06 MichaelMure

So to me this is not a Route53 provider failure, this is a dns01 one.

MichaelMure avatar Jun 02 '20 16:06 MichaelMure

If you have an intermittent timeout, I think you should check your network and its configuration, and the nameservers that you are using.

ldez avatar Jun 02 '20 20:06 ldez

You can also simply configure the DNS timeout.

--dns-timeout

https://go-acme.github.io/lego/usage/cli/#usage

ldez avatar Jun 02 '20 20:06 ldez

it does not look to be a timeout issue, it is only when the host name is on a subdomain, I am updating certificates on both sides of the failures for the main domain and it is only the lego client that fails, I have no issues when using https://github.com/acmesh-official/acme.sh

armsby avatar Jun 03 '20 06:06 armsby

If you have an intermittent timeout, I think you should check your network and its configuration, and the nameservers that you are using.

Sure, but that doesn't remove the fact that fetchSoaByFqdn doesn't handle well this kind of failure. This is especially a problem because if I'm not mistaken, DNS happen on UDP, that is without any guarantee of packet delivery.

Here is a screenshot where one of those failure happen:

Capture-20200603134905-1619x1284

In that case, the node being currently checked will be silently dropped and the function can return an incorrect result, that will cascade later in a bigger problem (complete failure of the certificate issuance).

MichaelMure avatar Jun 03 '20 11:06 MichaelMure

@armsby I got the exact same error message as you, that's even how I found this issue. The root cause might be something else than a timeout but if an error happen when doing a DNS query, you can eventually end up with this final error.

MichaelMure avatar Jun 03 '20 12:06 MichaelMure

@MichaelMure so your problem is a timeout so you can use change the dnsTimeout:

client.Challenge.SetDNS01Provider(provider,dns01.AddDNSTimeout(30*time.Second))

or

--dns-timeout

ldez avatar Jun 03 '20 12:06 ldez

I understand that but that's only a band-aid on this problem. Networking is unreliable by nature, especially UDP. A DNS request can fail for different reasons and the code doing those requests should handle those errors properly if possible.

MichaelMure avatar Jun 03 '20 12:06 MichaelMure

For me, the best way to handle timeout error is to configure dnsTimeout: this option is only for that, it's not band-aid.

ldez avatar Jun 03 '20 12:06 ldez

What if the UDP packet simply get lost or dropped somewhere on an unreliable connection? No amount of timeout will fix that and it will still show up as a timeout X minutes later.

MichaelMure avatar Jun 03 '20 12:06 MichaelMure

https://github.com/go-acme/lego/blob/1a82effaaac7f32b53b9920455a477b3364c2174/challenge/dns01/nameserver.go#L255-L266

https://github.com/go-acme/lego/blob/1a82effaaac7f32b53b9920455a477b3364c2174/challenge/dns01/nameserver.go#L259-L263

ldez avatar Jun 03 '20 12:06 ldez

Note: I certainly don't want to start an argument and as a free software maintainer myself I know that sometimes people get ... inconsiderate. But we should be able to agree on how the code behave.

MichaelMure avatar Jun 03 '20 12:06 MichaelMure

My understanding of the code section you linked is that a TCP DNS query will be done as a fallback if the UDP reply is too big. But that implies having a valid UDP response so that doesn't handle a packet loss.

edit: this happen when the reply is > 512bits: https://serverfault.com/questions/587625/why-dns-through-udp-has-a-512-bytes-limit

MichaelMure avatar Jun 03 '20 12:06 MichaelMure

Yes if not a fallback (I know the Truncated meaning) but it's not a simple DNS call.

Otherwise, create a fix without any information to reproduce the issue and create a blind fix seems to me not a good way to follow. I can create a retry system but I need to understand why (currently, UDP by it-self is not enough for me)

ldez avatar Jun 03 '20 13:06 ldez

Ha I see.

Well, I do not know why this particular DNS query fail so often for me, I have an otherwise reliable internet connection. Maybe it's because the certificates I'm trying to generate have a lot of nodes (it's in the form of *.foo.bar.fuu.boo.example.org) ? Or maybe I'm just more exposed to this problem because I generate a bunch of those certs in a row.

In any case, the dns.Client.Exchange() function's doc state that:

// Exchange does not retry a failed query, nor will it fall back to TCP in // case of truncation.

To me it implies that the possible failure is left to the caller to handle.

MichaelMure avatar Jun 03 '20 13:06 MichaelMure

I can of course test whatever solution you come up with and see if that fix the problem.

MichaelMure avatar Jun 03 '20 13:06 MichaelMure

I will trying to create a retry system.

ldez avatar Jun 03 '20 15:06 ldez

Thank you :)

MichaelMure avatar Jun 03 '20 15:06 MichaelMure

@MichaelMure could you try https://github.com/go-acme/lego/pull/1180 ?

ldez avatar Jun 03 '20 20:06 ldez

I'll give it a try tomorrow. That looks like a good solution.

MichaelMure avatar Jun 03 '20 22:06 MichaelMure

I'm working from home today and I just don't get any timeout from there. I'll try again from this other place that apparently have less than optimal networking.

MichaelMure avatar Jun 04 '20 13:06 MichaelMure