CT: Add support for fallback between an operator's logs

Open aarongable opened this issue 11 months ago • 1 comments

Today we only attempt to submit to one log from each operator. This limits our pool of potential logs, which can be a problem if some of those logs are experiencing uptime difficulties. It would be nice to be able to gracefully fail between logs run by the same operator, as long as we never end up actually including multiple SCTs from the same operator.

Apr 21 '25 21:04 aarongable

This is partly complete -- as of https://github.com/letsencrypt/boulder/pull/8152 we now submit to logs individually, instead of selecting just one log from each operator. This means we can submit to another log by the same operator if an earlier log by that operator failed.

But we'd like to do more:

[ ] Instead of racing logs against each other, simply kick off two goroutines, each of which is responsible for returning one SCT.
[ ] Instead of pre-shuffling all logs and working down the list, intelligently pick the next log to submit to based on what SCTs (and potentially what in-flight requests) we already have.
[ ] Bypass ctgo's built-in backoff-and-retry since it makes some of our submission latencies artificially high
[x] Support "test" logs (which don't have a status like Usable or Readonly) so we can use them in Staging
[x] Don't worry about log type or status except for SCT submissions. Let informational and final submissions go to any log (with an appropriate temporal interval).

May 05 '25 20:05 aarongable