kubernetes-letsencrypt icon indicating copy to clipboard operation
kubernetes-letsencrypt copied to clipboard

Google Cloud DNS challenges fail sometimes

Open tazjin opened this issue 8 years ago • 5 comments

As mentioned in https://github.com/tazjin/kubernetes-letsencrypt/commit/4e3bbd6b32bafd2e6e83f44f329792cb87099172 and the comment in the code, Cloud DNS updates sometimes have not fully propagated when they are marked as "DONE" and even when the DNS observer sees the change in all nameservers.

Presumably this is some eventual consistency deal on Google's side. It is "solved" for now with an artificial wait timer, but long-term we should figure out what causes it, if there's documentation about it and how to deal with it better.

tazjin avatar Sep 18 '16 21:09 tazjin

Still see this now and again on GCP DNS. Any chance of upping the wait timer? Maybe just for GCP?

ahume avatar Apr 13 '17 13:04 ahume

Hm, thanks for pinging! I've also seen this a few times on a GCP cluster, though it eventually sorts itself out. Still worth investigating whether I've misread the docs about what DONE means and if there's anything else that can be done instead.

Todo:

  • [x] Check Google Cloud DNS docs again for obvious mistakes
  • [ ] Possible to re-trigger challenge validation?
  • [x] Up wait-timer if above fails

Probably not doing this before the weekend due to the Easter holidays :)

tazjin avatar Apr 13 '17 14:04 tazjin

Re-triggering challenge validation seems like a nice idea.

In my experience once you've had one failure here, any subsequent DNS updates and challenges also fail. I normally have to stop the controller, delete any DNS records, and start-up again from scratch to get it to work. If it could just re-try the challenge a few times, that might well solve it.

Or if not, wait another 20/30 seconds. :)

ahume avatar Apr 13 '17 15:04 ahume

I've read through the docs again and as far as I can tell, DONE should mean done (but doesn't).

I'll up the wait timer and investigate the Let's Encrypt API to see if the validation can be triggered multiple times.

tazjin avatar Apr 19 '17 14:04 tazjin

From the ACME spec:

Clients SHOULD NOT respond to challenges until they believe that the server’s queries will succeed. If a server’s initial validation query fails, the server SHOULD retry the query after some time. While the server is still trying, the status of the challenge remains “pending”; it is only marked “invalid” once the server has given up.

I believe that Boulder (the Let's Encrypt server) currently doesn't retry DNS challenges at all, it immediately sets them to invalid.

tazjin avatar May 04 '17 21:05 tazjin