Design ocsp-updater work smoothing
Pre-requisite for #5544. Deliverable: a doc outlining the problem, one or more possible approaches, and pros and cons of each approach.
Problem
Various conditions can cause ocsp to have spikes of work. For instance, if there is an outage or other delay in running ocsp-updater, then when we come back online there will be a higher than usual number of responses that need re-signing. We'll resign all of those in a short amount of time, and they'll all come due again at the same time. We would prefer that the "due date" of those responses automatically gets smoothed out over time, so we don't have spikes of signing demand.
Solution 1: Assign to target update buckets
Each certificate would have an assigned time to be updated, based on some evenly-distributed input like a hash of its serial number. For instance, if our target is to re-sign all OCSP responses every 70 hours, we would take hash(serial) % 70. If the modulus is 0, the certificate is assigned to hour 0. If the modulus is 1, the certificate is assigned to hour 1, and so on.
We would have one task within ocsp-updater that signs an OCSP response for every certificate in the current update bucket, if the signing rate is otherwise < X (to be determined). We would also keep the current task that signs an OCSP response for every certificate where the last update is > Y hours ago. This task would not be conditioned on signing rate.
Solution 2: Eager signing
If the signing rate is < X, gradually decrease Y (the number of hours a response must be stale before we re-sign it), proportional to how far below X the signing rate is. This causes us to sign responses a bit earlier than we otherwise would. As signing rate increases toward X, we gradually increase Y back to its configured value.
Pros / Cons
Solution 1 will tend towards even distribution pretty quickly. I don't see a downside.
Solution 2 will be a bit fiddly to tune to avoid big swings in signing rate. We don't know how many additional responses we will find to sign when we reach a certain number of hours into the future, so we have to adjust the eagerness gradually in both directions. But if we're too conservative in the rate of change, we won't see much smoothing. If we're too aggressive, we'll see additional load spikes.
I propose solution 1.
Just to toss one more out there, since we want a histogram anyway for compliance monitoring. This is certainly more complex than Solution 1, but it doesn't rely on the smoothness of the hash, and thus will be less likely to cause immediate rework for freshly-issued certificates whose hashes happen to fall into the next hour's bucket. But really, it's just about using the histogram and perhaps some easier-to-reason-about goodies. In reality, Solution 1 + keeping a histogram for stats is probably still the right answer, but I don't want to leave this unsaid!
Solution 3: Histogram smoothing
Assumption: We already have the overhead of tracking the histogram of OCSP Responses by Minutes since Last Update, as that is a useful histogram for compliance metrics.
At the beginning of each iteration through the corpus of active certificates, determine:
- The set of responses which require immediate re-signing to meet their goal, BucketGoal.
- The histogram bucket or bucket(s) with the greatest number of responses contained within, excepting buckets "close to" (maybe just not overlapping?) BucketGoal, named BucketPeak.
- The deviation from the mean, in terms of number of responses, BucketPeak has versus the rest of the buckets.
- Emit the histogram for statistics.
- Clear the histogram.
Let QueueGoal be a queue containing responses to update from BucketGoal. Let QueuePeak be a queue containing responses to update from BucketPeak. Let PeakIntendedWorkaheadCount be the number of responses in BucketPeak minus the mean number of responses in each bucket across the histogram. Let PeakCount be a counter of how many responses have been added to QueuePeak, which will be less than or equal to PeakIntendedWorkaheadCount.
For each active certificate:
- Populate the histogram
- If the updated time for the response for this certificate lies within BucketGoal, add it to QueueGoal
- Else, if the updated time for the response for this certificate lies within BucketPeak:
- Increment PeakCount.
- If PeakCount is less than or equal to PeakIntendedWorkaheadCount, add the response to QueuePeak .
Simultaneously, while there are still responses in QueueGoal to work on:
- Let
Rbe the next response in QueueGoal, if immediately available. - If
Ris not set, letRbe the next response that arrives from either QueueGoal or QueuePeak. - Process
Rand write the updated response to the data stores.
Preemptively closing as work that will not be completed because we are instead removing ocsp-updater entirely: https://github.com/letsencrypt/boulder/issues/6285