[Task] [Epic]: DNS Plugin Rewrite to Multi-DNS Plugin
Problem:
DNS Plugins have become a maintenance burden on the Certbot team due to the long tail of requested DNS providers. Plugin architecture for Certbot needs to be maintained in a way that the community can still add in plugins while project maintainers focus on Certbot core.
Lexicon Option
A good portion of our plugins use Lexicon framework:
- certbot-dns-dnsimple
- certbot-dns-dnsmadeeasy
- certbot-dns-gehirn
- certbot-dns-linode
- certbot-dns-luadns
- certbot-dns-nsone
- certbot-dns-ovh
- certbot-dns-sakuracloud
Next Steps for Rewrite
We decided to not use Lexicon for now. However we still plan to rewrite using another available library. The following work will be needed to this forward:
Since all plugins would use the same structure, the possibility to mimic @alexzorin's multi-dns plugin with a credentials file to "rule them all" and funnel the needed credentials into a multi plugin and chosen DNS might be the best, less repetitive approach.
Migration would need to be a part 2 of this endeavor. For now the focus would be a multi-dns plugin that is working and tests well initially, and then migration plans for a later date.
- [ ] Create a multi-dns plugin that takes config values
- [ ] Test new multi-plugin with test credentials from a major provider
- [ ] Decide how to best test multi-dns plugin with providers we supports
Things and steps to ponder for part 2:
- [ ] Switching users to using the Lexicon multi-dns plugin safely and mitigate as much friction as possible (everyone would need to switch to using a credential
inifile Options for migration: -
- legacy config parser for each plugin which then passes values
-
- One time rewrite for the new format
-
- best-effort legacy config migration where it's feasible
-
- Create an option for users to ask Certbot what we can handle
-
- Worst case: error out and ask the user to look at lexicon's config docs
- [ ] Remove DNS plugins?
-
- Can we ultimately remove them once the work is done?
-
- What would sunset look like if so?
there was more discussion about this in mattermost in the thread starting at https://opensource.eff.org/eff-open-source/pl/byuwjcbfd7bztqkoabatjp33sy
high level thoughts
i think we're all mostly on the same page here. i think the high level goals are:
- create a new DNS plugin around a library like lexicon that abstracts away the differences between DNS APIs allowing us to much more easily support more DNS providers
- migrate as much of our existing DNS plugin users to the new plugin as we can with minimal breakage
i think there are many unknowns here which will make a lot of high level planning here tricky. we can try and plan ahead and rewrite our plan as needed, but i think it may be easier to just focus on planning immediate next steps, executing them, and repeating. if y'all can forgive the reference, i'm basically thinking that tackling this in a more agile style than a waterfall one may have some value
DNS library choice
personally, i think the first step here should be deciding on the library we want to use. we could skip this and just choose lexicon as it's an obvious strong contender since we already have plugins based on it and have worked closely with its maintainers in the past, but i also think that a solid choice here could make our job significantly easier down the line. if we choose a library with many bugs or limited features/maintenance, i worry we may come to regret our decision and want to do this dance all over again
i'm not sure what else is out there nowadays. are there other nice python libraries? another thing that i think we could consider is using alex's approach from https://github.com/alexzorin/certbot-dns-multi and using the lego DNS providers. i'm a little hesitant to venture into using non-python libraries directly ourselves, but i think it may be worth it if we find a really nice library in another language
i think one thing to look at is the features offered by our current DNS plugins that are NOT based on lexicon and see if they can be offered by whatever library we choose here. for example cloudflare, google cloud, and amazon route53 all have pretty advanced features. which of these could we migrate to each 3rd party library without a major loss in functionality?
to use lexicon again as an example, they rely on boto3 for amazon route53. do they more or less expose all the features we did? what about for the other DNS providers where they don't rely on the official DNS libraries?
i think the ease of migrating our non-lexicon DNS plugins is just one aspect of our decision here. i think that if we have to continue supporting some of these more complex DNS providers directly, it's certainly not the end of the world. i just think pushing as much of this work upstream as possible is nice and i think this is an aspect of our library choice here that may not be immediately obvious
if y'all agree that library research is a good first step here, i think the first concrete goal here should be a research writeup of the different libraries considered and their pros and cons. alternatively, this could just be a deep dive into lexicon, trying to look ahead for any problems before we start implementing our new plugin with it as its main dependency
compatibility and stability
the only other thing i wanted to share right now is i personally think we should be fairly slow to release a stable version of this multi-DNS plugin until we've had a chance to play with it a fair bit and see what migrating other plugins looks like. i worry the work of migrating the other plugins may cause us to want to make changes to the multi-DNS plugin and it'd be nice to be able to do so without worrying about stability or maintaining compatibility for any existing users of the multi-DNS plugin
that's also not to say we should release a stable version and migrate everyone over all at once. that's a lot of stress on a fairly new piece of code. i'm just saying we should probably have an alpha/beta phase here and/or have looked deeply into things like how to migrate certificate renewal for existing plugin users to the new DNS plugin before we offer a stable version here
@wgreenberg, this post may be especially useful for you if you're on the same page as me as thinking carefully about lexicon and/or our other options as a first step here
Looks like documenting clarity on the following would help:
- Library stability (bug prone?, update frequency, etc)
- Assessing functionality loss
- trade offs with complex DNS providers
the only other thing i wanted to share right now is i personally think we should be fairly slow to release a stable version of this multi-DNS plugin until we've had a chance to play with it a fair bit and see what migrating other plugins looks like.
That's fair maybe splitting this into structure, testing, and usage of the multi-DNS plugin first and offsetting migration as it's own project. Because ultimately, migration with a refactor often calls for it's own management.
thanks for the write-ups, @zoracon and @bmw! my current plan is to dig deeper into lexicon to answer the above questions, and if the answers seem less than optimal, i'll check out some alternative approaches.
i think one thing to look at is the features offered by our current DNS plugins that are NOT based on lexicon and see if they can be offered by whatever library we choose here. for example cloudflare, google cloud, and amazon route53 all have pretty advanced features. which of these could we migrate to each 3rd party library without a major loss in functionality?
@bmw which advanced features are you referring to here? in terms of these plugins' user-facing features, there's not much except for default_propagation_seconds and some plugin-specific stuff like specifying the Google Cloud project in question
after poking around a bit, i think the sources of these features would be things like:
- our plugin command line flags (which you found for things like the google project id)
- features in the underlying non-lexicon library which certbot benefits from
- hidden automation/logic inside our plugin code which could maybe be found by skimming our plugin code and/or looking thru
git log <plugin_dir>if we wanted
there are examples of each of these below
cloudflare
there may not be much here! we support API keys in addition to tokens which lexicon does not, however, tokens are considered a better practice with cloudflare so that's probably fine
i think we're also saved by this text in our cloudflare docs which i hadn't remembered:
Please note that the cloudflare Python module used by the plugin has additional methods of providing credentials to the module, e.g. environment variables or the cloudflare.cfg configuration file. These methods are not supported by Certbot.
google cloud
from our documentation (which i'm seeing now that i messed up the link to last time), i'm seeing the project ID (which again you mentioned) as well as support for application default credentials (ADC) which it doesn't seem like lexicon has? we've even been recommending using ADC as the better approach in our documentation
our plugin also has this internal logic for ignoring private zones
amazon route53
the only thing i'm really seeing here are all the ways we support specifying credentials (environment variables, hidden files, etc.) which lexicon doesn't seem to support
we also have code to skip private zones like we do with google cloud, but it seems like this may be supported with lexicon's private_zone option for route53? in that case, maybe we could work with them to add a similar thing for google if we wanted
one more general thought
if we end up going the lexicon route and want ideas/inspiration on things like architecture, i think it may be worth looking at https://github.com/adferrand/dnsrobocert. this uses both certbot and lexicon, is written by a former certbot contractor and current lexicon maintainer, supports all lexicon providers, and even supports the CNAME logic i wrote my doc about a couple months ago. i think this may be less useful now if we're in a more abstract research phase, but i wanted to mention this before i forgot about it again
amazingly helpful writeup, thanks @bmw!
some additional notes:
- i think that lexicon's less flexible route53 and cloudflare credential support is probably fine, but agree that the lack of Google ADC support is a problem, so i opened lexicon issues for ADC and (private zone filtering) support. currently their Google plugin doesn't have external dependencies, so it's unclear if they'll be open to adding one for ADC support, but if they do i suspect it'll be relatively easy to implement
- we currently emit warnings when we detect that someone's credentials file has unsafe permissions, which we could easily do with their lexicon config. however, if the user has a config without secrets (either by passing them via env vars or one of Google ADC's other routes), this warning would be incorrect. not sure if there's a nice way to resolve this, but imo it's a minor concern
- checking out dnsrobocert was informative! for one, it shows just how simple our lexicon integration could look (https://github.com/adferrand/dnsrobocert/blob/0086a1b555cf6b8161dbc8815f1933b8a8677f1c/src/dnsrobocert/core/challenge.py#L53-L57 appears to be the entirety of robocert's interaction w/ lexicon). it also uses a sort of nested config setup, where there's a robocert yaml config which contains a yaml dict that gets passes into a lexicon
ConfigResolver. if we needed to provide some meta-config for our plugin, i'd say we should go this route
after looking into this, i'm in favor of making a lexicon multi-plugin. there's a couple open design questions i have about it now, though:
- to what extent do we want to support or migrate users' existing credentials? this could range from something as simple as printing a helpful warning telling people to visit the lexicon config docs, to injecting them into lexicon's
ConfigResolver, to writing them to a lexicon yml file. - relatedly, do we want to maintain existing CLI flags for the DNS plugins? i think this is probably a no due to the sheer number of them, but it might be nice since it'd mean not breaking people's existing configs
- should we add the "follow CNAME" logic before creating a record? i'd definitely vote yes on this, it seems extremely straightforward to add, and the net reduction in complexity we'd get from our lexicon rework more than allows for one or two new QOL features like this
one thing i was a little bummed about w/ lexicon is that it doesn't seem you can implement custom providers, or overrides for existing ones, without just forking the repo. so we're either limited by upstream's decision-making (as in Google ADC support), or are forced to maintain our own fork
Given that we're trying to rip out all of our custom support, that might be fine. If people want more lexicon things and we have to push them upstream that sounds like the correct process result.
@wgreenberg TY for looking into the details further! With feedback from the last team discussion combined with your research, it seems a "stage 1" could be making sure we have a working and tested plugin. And "stage 2" would be handling migration to the multi-dns plugin.
We would be beholden to some things upstream it seems, but I hope there's response to the issues that you opened.
My question is: do the upstream pitfalls we foresee override the benefits of creating and migrating to this multi-dns plugin?
the only case where relying on lexicon is a major issue is Google's ADC functionality: lexicon's only auth mechanism for Google is a JSON file of credentials, which Google's docs recommend against using
i started working on implementing ADC for lexicon, but it'd basically be a full rewrite of that provider, so i stopped until the maintainer responds to the issue
poking around https://github.com/dns-lexicon/dns-lexicon, it looks like adrien may be the only maintainer of lexicon nowadays?
historically the lexicon repo was at https://github.com/AnalogJ/lexicon and analogj did a lot of work, however, it looks to me like his last commit to the project over 5 years ago