boulder ACMEv2: deal with perpetual "processing" orders

trafficstars

It's possible for orders to enter processing status but not go to either valid or invalid status, due to RPC errors. This would cause problems with the pending order rate limit. We should add a cleanup task (possibly to an existing daemon) that cleans up orders that are still processing after some amount of time (say, five minutes or an hour).

Feb 06 '18 19:02 jsha

Just ran into this following today's outage (order: https://acme-v02.api.letsencrypt.org/acme/order/44933623/199303662).

Nov 30 '18 23:11 tsuna

Right now the RA tries to handle errors during issueCertificate by calling out to ra.failOrder. However, in the most common case of failures during issueCertificate, the failure was caused by a timeout, which means the context deadline is almost up. I think rather than implementing a whole cleanup daemon, it makes sense to change the ra.failOrder to run in a goroutine and use a different context with a deadline that's still in the future.

Dec 11 '19 23:12 jsha

(Adding helpwanted label because this seems like a good candidate for a goroutines starter bug.)

Feb 17 '21 02:02 aarongable

I'd be glad to help on this one if it's still relevant.

Right now the RA tries to handle errors during issueCertificate by calling out to ra.failOrder. However, in the most common case of failures during issueCertificate, the failure was caused by a timeout, which means the context deadline is almost up. I think rather than implementing a whole cleanup daemon, it makes sense to change the ra.failOrder to run in a goroutine and use a different context with a deadline that's still in the future.

This seems reasonably doable, using a new context from the expired one and/or another cancelation pattern. Shall I attempt to re-read the code in its current form and raise a PR? I may ask for more clarifications on the way when reading the actual part of the code base.

Nov 28 '22 19:11 inge4pres

Sure, go for it! Happy to answer any questions you run into here. The ra.failOrder function is only called by FinalizeOrder, so it's probably easiest to make this change directly within failOrder.

Nov 29 '22 00:11 aarongable

After some time looking into the code, I have a rough understanding of what the actual behavior is and what's the desired change.

My understanding is that:

failOrder is responsible for informing the StorageAuthority about a cleanup to be made (via any SetOrderError implementation)
failOrder is passed the same context as FinalizeOrder, which expires when any of the ra methods in it takeslonger than expected (most commonly issueCertificate, interacting with other RPCs)
since context is expired, failOrder doesn't really do what it should

If this assumptions are correct, I don't think it's necessary to add an additional background process to complete the cleanup; we may create a new context that is passed to failOrder, with a deadline that is fixed and that we expect that will be respected by the SA operations to set the order in error.

Does this make any sense?

Dec 02 '22 16:12 inge4pres

That's almost all correct! The one correction is that we will actually need a background process (goroutine) to handle failOrder.

As you said, the reason that the context is getting cancelled is because the total request is taking longer than our configured timeout (say, 200ms). The whole point of having that timeout is that we want the API call to return to the user quickly, provide the result or error quickly, and not continue consuming our frontend resources for a long time.

If we create a new context with a new timeout, and pass that to failOrder (or equivalently have failOrder create a new context and pass it to sa.SetOrderError), then sure, that call won't get cancelled... but it will still be a synchronous call, and will end up making the total request take longer than 200ms.

So instead we should both create a new context with a slightly-further-out deadline, and spin the request off into an asynchronous goroutine that can set the order's error status without continuing to block the already-timed-out API request.

Dec 02 '22 22:12 aarongable

So instead we should both create a new context with a slightly-further-out deadline, and spin the request off into an asynchronous goroutine that can set the order's error status without continuing to block the already-timed-out API request.

Thanks for the explanation 🙏🏼 I hadn't understood latency was a concern.

I raised a PR with a possible implementation: it's a per-RPC goroutine that will wait to receive either:

a message with a ProblemDetails coming from the order processing
a context deadline expiration

and will fail the order if necessary.

This additional per-RPC goroutine runs for every FinalizeOrder RPC request, not only when an order needs to be marked as failed. In case there is no failure in any of the steps needed to fulfill the order, it simply does nothing: it exits when the processing is completed and the receiving channel is closed.

There is a second possible implementation that I thought of, but didn't choose: a "global" goroutine, attached to RegistrationAuthorityImpl that always runs in the back waiting for orders to be marked as failed. I didn't chose this second option because:

the RegistrationAuthorityImpl struct is already big, with many fields orchestrating various RPC clients and configurations
it's more risky as a unique goroutine could be blocking the senders in the RPC, with the result of further delaying the response

Let me know your thoughts 😄

Dec 04 '22 17:12 inge4pres

Thanks for tackling this! I've posted comments directly on the PR. I think the general approach of using a single short-lived goroutine for each request (rather than having a long-lived background goroutine receiving problems from all requests) is definitely the right idea. But I think it's not necessary to run that goroutine for every FinalizeOrder request; rather I think we can get away with only running it for FinalizeOrder requests that fail.

It may even be possible to only spin off the goroutine for FinalizeOrder requests that fail due to context timeout. That would be great.

I've left comments to this effect directly on the PR. Let me know if you have any questions!

Dec 05 '22 19:12 aarongable

It may even be possible to only spin off the goroutine for FinalizeOrder requests that fail due to context timeout. That would be great.

I believe this case was covered in the initial implementation with the always running goroutine for every RPC 🤔

Let me recap once more what the final design could look like:

a single goroutine to optionally fail the order is spawned on every FinalizeOrder RPC, it will run asynchronously to not delay the RPC response
this goroutine waits for either: a. the expiration of the context of FinalizeOrder, signaling the order can't be processed due to downstream failures b. a message of type *ProblemDetails that signals an error during processing the order c. the input channel being closed (which means the entire order processing has completed successfully)
in all cases, it must create a new context with a fixed duration (1 sec) to call into SA.SetOrderError to mark the order as failed

I have a question here: is it ok to call into SA.SetOrderError even when SA.SetOrderProcessing has not completed due to context expiration? In case case 2.a I had asked if it was ok to add a metric https://github.com/letsencrypt/boulder/pull/6538#discussion_r1039991901, what type of error handling should we do there?

I've left comments to this effect directly on the PR. Let me know if you have any questions!

Thanks for the review and the comment, they were very helpful 🙏🏼 I've pushed a new commit incorporating your suggestions, but might as well push a new commit with what described above, and then we can chose if we want to keep it or not.

Dec 12 '22 11:12 inge4pres

https://github.com/letsencrypt/boulder/pull/6538/commits/c6ae098efce13ab2827b972481198121c5fd2906 implements what described above. The difference with the first implementation is that the concurrency control is more explicit, living in a single closure inside processOrder rather than in failOrder.

Let me know if you'd like to keep this commit as well, or if it's better to stay with the previous commit. Personally I like better this second version as it ensures that any timeout in downstream requests made by FinalizeOrder is captured and marks the order as failed.

Given my very limited exposure to the codebase, this is my biggest doubt now

I have a question here: is it ok to call into SA.SetOrderError even when SA.SetOrderProcessing has not completed due to context expiration?

Dec 13 '22 14:12 inge4pres

boulder boulder copied to clipboard

ACMEv2: deal with perpetual "processing" orders

boulder
boulder copied to clipboard