lego icon indicating copy to clipboard operation
lego copied to clipboard

export resolver.obtainError?

Open dhduvall opened this issue 5 years ago • 14 comments

I got a badNonce during Obtain(). Based on the logs, it looks like it failed before it got confirmation of validation: I see the five badNonce errors after Checking DNS record propagation, but not the The server validated our request message; if I GET the auth URL, it tells me that the status is valid.

Obtain() returns an error that, according to %T, is of type resolver.obtainError. If there's a way to "cast" this to the underlying map[string]error, I can't figure it out, which means I can't get at any of the underlying errors and either report them, store them, act on any data they may contain, etc. In this case, I think I'd be able to assert that the error value in the map is actually an acme.NonceError and retry more (though I don't think I have access to either the URL I'd need or the routines to operate on it).

The only workaround I can think of is to parse the error string. At least that's fairly nicely structured, but it's still kinda gross.

Can resolver.obtainError be made public? That should be non-breaking, yes? Are there other errors that might be useful, as well?

I guess what I'm really looking for is a way for lego to tell me that although it failed, the operation is retryable (the failure isn't terminal), and to give me a way to retry it from whatever point it failed. This would take care of #771, too, but is probably a huge change.

dhduvall avatar Feb 09 '19 00:02 dhduvall

This also means that when I'm mocking Obtain(), I don't have any way of making my mock return one of the actual errors it'll get, so my testing won't be especially complete. This probably speaks to exporting the type, regardless of whatever other immediate uses it might or might not have.

dhduvall avatar Feb 09 '19 00:02 dhduvall

If you have an issue related to badNone this need to fixed inside lego.

I made the choice and spend time on v2 to not export obtainError (it was exported in v1), then I don't really want to change that.

ldez avatar Feb 09 '19 00:02 ldez

It's not clear to me if I run into the 5-badNonce limit, if that really means something's wrong, and there's no point in trying again, or whether the error is still temporary, but may need more tries than that.

In some ways, I don't really care; doubling the limit to 10 might make it work for more cases, but I suspect that there will always be some scenario where there's no suitable number, and lego really has no choice but to fail. That's fine, but my code has to be able to deal with that situation, and I have to be able to write tests for my code that can exercise its handling of it.

Other than doing a bunch of string processing, I don't know how to do either with lego as it is now.

dhduvall avatar Feb 09 '19 00:02 dhduvall

I need more context, because badNonce can have multiple sources. Doubling the limit is not a solution for me.

ldez avatar Feb 09 '19 00:02 ldez

What context are you looking for? I can give you the logs from that run. Is there anything else?

dhduvall avatar Feb 09 '19 00:02 dhduvall

Do you have many SAN? Do you make concurrency calls? etc.

I need to understand how the issue appears for you.

ldez avatar Feb 09 '19 00:02 ldez

This happened with two concurrent calls to Obtain(), each for an order containing two domains, one of which is a wildcard. That is, order 1 is for a.domain.com and *.a.domain.com, and order 2 is for b.domain.com and *.b.domain.com.

I hope to drive the concurrency up a lot (to anywhere between 50 and 500, probably), but have been testing with small batches so that the logs are easier to read.

I'm running this from a VM inside Google Cloud. It doesn't have a permanent external IP address, at least for IPv4; not sure whether it's ever connecting to LE via IPv6, and if so, what that looks like.

I do have the patch from #783 applied, and am using it to institute a two minute delay prior pre-check, and that seems to be working for me as I expect.

For the order that did work, the time from "preparing to solve" to "validated our request" was 2m18.5s for the wildcard, and 2m15s for the other, spanning a total of 2m19s. The "Wait for apply change" and "Wait for propagation" aren't tagged with their respective domains, but there aren't many of them; only two of the latter (suggesting that it didn't actually have to do any waiting), and six of the former.

There are actually six nonce errors in the logs; the first one is actually for a challenge URI for the order that ended up succeeding. About .15s later, there are five nonce errors in a row for an authorization URI from the failed order. They're spaced by .12s, .31, .19, and .24s, respectively (the "too many retry" error comes .2s later).

dhduvall avatar Feb 09 '19 01:02 dhduvall

That was what I thought.

I know there is a concurrency problem that produces badNonce, for now I haven't found the best way to handle this yet.

I wanted to correct that for v2, I de-prioritized because I had no issue on the subject.

ldez avatar Feb 09 '19 01:02 ldez

We have a service that uses LetsEncrypt to provide certificates for many different user apps. Some of our users misconfigure their DNS and so we are sometimes not able to provide them certificates.

We used a different Go ACME library (github.com/hlandau/acme/acmeapi) which does not support ACMEv2 (though the author may be fixing that right now?). Porting to Lego has overall been a great experience that let us delete the majority of our code.

However, our old implementation did parse the specific errors (which here you have as acme.ProblemDetails.Type) returned from LetsEncrypt for monitoring purposes. Some error types we wanted to get alerted on; others were just user misconfigurations and we wanted to display them in our dashboard to users but not alert us.

We can almost implement that with Lego, except that we can't break into the obtainError returned by Obtain to get the nested ProblemDetails.

glasser avatar Oct 18 '19 00:10 glasser

@glasser Just curious, what is the service you're using? (I ask because several of us in the industry are currently drafting up a document about best practices for ACME clients, especially those which operate at scale.)

mholt avatar Oct 18 '19 03:10 mholt

I want to log original errors received from let's encrypt but since obtainError not exported, there's no way to cast the err to map[string]error and obtain the error message. Is there a reason it's not exported? I've worked with other contrib libraries that had a need to log multiple error messages and they usually have exported an custom error type with an additional map field that contains these messages

csandanov avatar Mar 12 '22 13:03 csandanov

For me, the current obtainError type is still not exportable, this type was created a long time ago, the error system provided by Go has been improved. I think I can create a new type that can be better for lego API consumers.

ldez avatar Mar 12 '22 13:03 ldez

any updates on this issue? or maybe a recommendations for a workaround?

csandanov avatar Apr 18 '22 11:04 csandanov

any updates?

csandanov avatar Sep 01 '22 07:09 csandanov

sorry it takes more time than I was thinking, but now the underlying errors related to obtainError are usable. See #1993

ldez avatar Aug 18 '23 10:08 ldez