kubernetes-letsencrypt
kubernetes-letsencrypt copied to clipboard
Google Cloud DNS challenges fail sometimes
As mentioned in https://github.com/tazjin/kubernetes-letsencrypt/commit/4e3bbd6b32bafd2e6e83f44f329792cb87099172 and the comment in the code, Cloud DNS updates sometimes have not fully propagated when they are marked as "DONE" and even when the DNS observer sees the change in all nameservers.
Presumably this is some eventual consistency deal on Google's side. It is "solved" for now with an artificial wait timer, but long-term we should figure out what causes it, if there's documentation about it and how to deal with it better.
Still see this now and again on GCP DNS. Any chance of upping the wait timer? Maybe just for GCP?
Hm, thanks for pinging! I've also seen this a few times on a GCP cluster, though it eventually sorts itself out. Still worth investigating whether I've misread the docs about what DONE
means and if there's anything else that can be done instead.
Todo:
- [x] Check Google Cloud DNS docs again for obvious mistakes
- [ ] Possible to re-trigger challenge validation?
- [x] Up wait-timer if above fails
Probably not doing this before the weekend due to the Easter holidays :)
Re-triggering challenge validation seems like a nice idea.
In my experience once you've had one failure here, any subsequent DNS updates and challenges also fail. I normally have to stop the controller, delete any DNS records, and start-up again from scratch to get it to work. If it could just re-try the challenge a few times, that might well solve it.
Or if not, wait another 20/30 seconds. :)
I've read through the docs again and as far as I can tell, DONE
should mean done (but doesn't).
I'll up the wait timer and investigate the Let's Encrypt API to see if the validation can be triggered multiple times.
From the ACME spec:
Clients SHOULD NOT respond to challenges until they believe that the server’s queries will succeed. If a server’s initial validation query fails, the server SHOULD retry the query after some time. While the server is still trying, the status of the challenge remains “pending”; it is only marked “invalid” once the server has given up.
I believe that Boulder (the Let's Encrypt server) currently doesn't retry DNS challenges at all, it immediately sets them to invalid
.