Prefer publishing full tiles
Add ability to delay submissions for up to a few sequencing rounds to try to publish full data tiles instead of partial tiles whenever possible.
Not sure if you want this, but we just recently implemented it in Azul (https://github.com/cloudflare/azul/issues/33).
Interesting, how frequently do you sequence? I felt like 1s was already pushing it since the whole issuance chain is held up on it.
The Azul prototype logs are sequencing every second too. I'm not really sure how much tolerance CAs have for sequencing delay -- but if 1s is acceptable then maybe 2s in the worst case is OK too? And it's only the unlucky "leftover" entries that don't fit into full tiles that would end up getting delayed. Since those entries were the last to be added to the pool, they'll probably only be delayed by ~1.5s on average.
There's probably a way to model this nicely, but intuitively if a log is sequencing 128 entries/second and you're willing to delay entries by a single sequencing round, then on average half the entries will be delayed by a sequencing round and the others will be sequenced right away. Under a heavier load, the benefits only go up, since more entries will fit into full tiles right away. Under a lighter load, you'll delay a larger share of the entries but still be producing fewer partial tiles.
Speaking only for Let's Encrypt and our CT implementation:
We time out and try another log after 2 seconds. Ideally logs wouldn't come too close to that, so we're not timing out early and "wasting" an extra CT submission on another log. We can probably make that longer, if we had to.
Navigli2025h2's 75th percentile of time for successful submissions hovers somewhere just under the 1s mark, which is about where I'd want it.
Could we wait more? 5 second? A dynamic value tuned to a particular log's performance? Maybe, but that's just more engineering and validation.
I think if the timeout was something like:
- At 0.8s, begin waiting for the current tile to fill.
- At 1.2s, time out and write a partial tile.
That might avoid some partial tiles without too much impact. Assuming that we're averaging about 128 entries/sec, we could target every other sequence filling an aligned tile.
For logs maxing out throughput, perhaps the rate-limit could be expressed in tiles instead of entries. So instead of 750 entries/sec, it's 3 tiles - we fill the previous partial tile, and add up to two more full tiles. Logs aren't sequencing at their limit very much though, so I don't know if that's helpful.
Thanks for the insight! A per-entry max sequence delay of 1.2s is a useful target. It's also relevant for the batching that the Azul implementation does--right now batches time out after 1s, so I'll look into reducing it. The prototype logs are hovering in the 2-3s range for add-[pre-]chain requests, which doesn't meet the current Let's Encrypt threshold.