Error Handling When Plugin Options Change
When the options for a plugin change (most typically, adding a new option) it leaves Lemur in a broken state, at least when dealing with certificate authorities.
I would expect Lemur to check that all expected options exist, and if any are missing, create them with the default thus gracefully handling these upgrades. Alternately, I would expect there to be a Lemur command that could be run to validate all plugin configurations and give the ability to correct any that are missing.
Instead, Lemur errors out, without any error displayed in the web UI, leaving functionality broken. Most recently, this happened when upgrading to 0.8.0 because of an option added to the LetsEncrypt plugin. When loading the Authorities page, no data would load and the logs would be full of python errors. The only way to fix it was to manually go into the PostgreSQL backend and add the new setting to each configured authority (even the ones that were not enabled). Only then would the page load properly. This also caused the reissue job to fail in the same manner.
Steps to reproduce:
- Set up lemur 0.7.0
- Configure a Lets Encrypt authority
- Upgrade to lemur 0.8.0
- attempt to load the Authorities page
Thanks @ardichoke for sharing your experience. You are making a valid argument. Migrating to the latest Lemur versions should be smooth, and not break existing functionality.
tagging @peschmae who has made significant contributions to the acme plugin, and might be able to support us with smoothening things out. I will also take a point to look into this, time permitting.
Any chance you know which config property needed to be set manually in the database?
It was the store_account option from 898b5da6613294403da6683f20c45abe3f4bd7f3, though I'd wager that the addition of any new database backed option to a configured plugin would cause this issue to happen.
we should be able to do the check here, and potentially add the store_account to the plugin options if missing. I wonder if we should be opinionated and set the store_account to true, if it is missing, or just set it to false, and continue the previous behavior of "Creating an account for each certificate issuance". The latter would be be less intrusive to the experience, but missing an opportunity to optimize.
https://github.com/Netflix/lemur/blob/ad5f7aef822085c7889534f7c1d8a17b5e213e35/lemur/plugins/lemur_acme/acme_handlers.py#L185
By default the store_account option is set to false, since I wanted to keep the current behavior, even though in my opinion it would make more sense, to reuse the account.
The rate limit to create new accounts is way lower than the rate limit for certificates (10 accounts / 3 hours vs 300 orders / 3 hours), and would lead to an issue if you try to renew more than 10 certificates at once.
So I vote to set the default value to true, and fix the check in acme_handlers to default to true as well, and just set the value if it isn't set yet.
But I'm not sure if that would solve the reported issue, that the Authorities page is broken after the update. I'll need to reproduce this, and get a look at the python logs, to know for sure what's happening there.
Here's a sample of the error logs from when I was troubleshooting this last week, before we figured out how to fix the issue. I don't know how helpful this will be, in the process of adding the missing option in, we messed up the JSON format in the database, which took us a while to figure out how to fix properly. Not sure if this error was due to the missing option or the bad format.
That looks more like the exception you get, if the authority options aren't valid JSON. But I'll try to get my local dev environment up and running again on the weekend or sometime next week, so I can spin up lemur 0.7 and then upgrade to 0.8 to reproduce the issue
Yeah, that's what I figured was the case. Unfortunately, I think the logs from before the JSON got hosed were lost over the course of working on this.