boulder icon indicating copy to clipboard operation
boulder copied to clipboard

Improve timeout configuration, esp for async operations

Open aarongable opened this issue 5 months ago • 2 comments
trafficstars

Today, we set gRPC timeouts on a per-service basis. So, for example, every request from the RA to the SA has the same timeout set, no matter whether that request is a read, a write, or a big batch operation; no matter whether that request is the only one in the whole code path or the first of many sequential requests; no matter whether an HTTP client is waiting on the result or we're doing async work.

For our most extreme example, take CAA checking. The RA calls VA.CheckCAA at two different times:

  • during validation, which is async, and during which CAA is the last major operation (after DoDCV); and
  • during finalization, which is synchronous, and during which CAA rechcking is one of the first major operations (before all issuance and getting SCTs).

Having the same timeouts across all methods on a service, and even across the same method called in different contexts, is not serving us well. We should have more granular control here.

Proposal 1: Keep everything in gRPC. The cmd.GRPCClientConfig would grow a new stanza which maps method names to specific timeouts; any unmapped methods would inherit the top-level timeout. This provides a uniform solution which can be applied across all of our gRPC clients as-needed, but doesn't handle the CheckCAA case. For cases like that, we could create a new RPC (e.g. va.RecheckCAA) which calls the same underlying server code but could have a separate timeout configured on the client side.

Proposal 2: Bespoke timeouts. Set all of our gRPC timeouts to some default value (say, 90s) which will serve only as a backstop. Configure the WFE with per-API-method timeouts, so that our normal "shave a few ms off at each gRPC layer" system can provide tighter deadlines for each component. Where we need custom tighter timeouts, e.g. for caa rechecking, add new bespoke config items and have the boulder code calling those methods manually call context.WithTimeout. This gives us fine-grained control over timeouts at all levels of the boulder, but doesn't form a holistic "system" for controlling timeouts.

In general, I lean towards Proposal 2. This is for two reasons: first, I think that adding gRPC methods like "va.RecheckCAA" solely for the sake of timeouts is unfortunate and unergonomic, and so Proposal 1 will end up having some custom timeouts like Proposal 2 anyway; and second, I like the idea of all of the WFE's API methods having timeouts because it makes it gives us a one-stop-shop to see how long these methods should actually be taking.

aarongable avatar Jun 02 '25 22:06 aarongable

Configure the WFE with per-API-method timeouts

Note that this can (and should) happen under both Proposal 1 and Proposal 2.

To narrowly evaluate how we should solve the "CheckCAA doesn't leave enough time for getting SCTs" problem, I think it makes sense to treat WFE per-API timeouts as their own issue, not as part of proposal 2.

Under proposal 1 we could manage CheckCAA by simply decreasing the timeout for both initial CAA checks and final ones to 20 seconds, since 30 is really quite long for those calls. Actually, thinking about it, there's value in using the same timeout for both rechecks and initial CAA checks: if your DNS server is slow and always returns results at 29s, it would be inconsistent to have the CAA check succeed at initial validation but fail upon rechecking due to the shorter timeout.

jsha avatar Jun 02 '25 23:06 jsha

So your proposal is: Set per-API-method timeouts in the WFE, and set per-gRPC-method timeouts in our gRPC client configs, and use those timeouts to reduce the CheckCAA timeout? Works for me.

aarongable avatar Jun 02 '25 23:06 aarongable