go-cloud pubsub: add batch timeouts

trafficstars

This PR adds a timeout option to batcher. The timeout specifies how long the batcher waits before sending batched messages, even if the message length is under MinBatchSize.

This ensures delivery guarantees within a given time period if there are low-throughput messages with a min batch size that may not be frequently met.

We'd like to set the minimum batch size, but had concerns that messages may not be sent if the minimum isn't met. One other consideration may be to flush batches on shutdown.

This PR is the barebones concept around batch timeouts, with a minimal untested plan. If this looks good to the gocloud team, we'll continue working on this and add tests for upstream.

What are your thoughts?

EDIT: This is now a PR on top of #3386.

To do:

[x] Implement batch timeouts in options
[x] Only allow one concurrent request to check for timeouts
[x] Handle timeouts when grabbing next batch
[x] Tests

Feb 23 '24 17:02 tonyhb

What problem are you seeing today that you're trying to solve? I.e., why do you want to set MinBatchSize to something greater than 1 if your throughput is low?

Feb 25 '24 02:02 vangent

@vangent our throughput is quite high — in the thousands to tens of thousands a second / tens to hundreds per batch, per container.

The problem lies with graceful shutdown and the distribution of work. On SIGINT, machines will stop processing new work. Outstanding work will continue. Most work finishes in milliseconds, but outliers take minutes to hours. As these outliers finish, the throughput drops dramatically, sometimes even to a single job. In this case, we still need pubsub to continue to send.

Without a timeout, a concern is that the min batch size may never be met in the case of shutdown, leading to messages that aren't flushed. Timeout will ensure that this finishes; we can extend the grace period of shutdowns by at least timeout. Timeout is also a guarantee on minimum QoS in the case of unpredictable issues.

Feb 27 '24 14:02 tonyhb

Interesting, thanks for the description. What do you hope to improve by increasing the MinBatchSize above 1?

Context: This proposal would add a fair amount of complexity to an already-quite-complicated system, so I'd prefer to avoid it if possible. Every time I touch this code I have to page in a bunch of context, run tests and benchmarks, and there are often surprising side effects for various edge cases.

Flushing on Shutdown seems like it might be a simpler approach...?

Feb 27 '24 18:02 vangent

I thought about flushing on shutdown and had a note in the comments/PR (or so I thought)! Totally understand wanting to keep things simple, too. That's my entire approach where possible :)

More context: with Google Pub/Sub we're sometimes seeing context timeouts directly in the gRPC layer. This happens even when passing fresh contexts that have no timeouts/deadline into publish. Google Pub/Sub's accept latency is ~3ms, and so we're suspecting that we're sending too many requests and not batching enough. This is corroborated with Pub/Sub's dashboards: we have thousands of RPS, and batches are smaller than we expect for our throughput (again, many thousands per second, though batch sizes are sometimes single digits).

I was hoping to use minimum batch sizes of, say, 10 or 20, to increase batch sizes and throughput at the cost of some latency while still being "safe". Ideally, timeouts would let us do this while also capping the max latency we can introduce, ensuring decent QoS.

Definitely in agreement that shutdown should flush batches. I'll follow up in a separate PR with that.

All of that considered, I totally understand the increased complexity. What are your thoughts on the QoS/latency guarantees and the overall context added here?

Feb 27 '24 21:02 tonyhb

@vangent here's a flush on shutdown PR with tests: https://github.com/google/go-cloud/pull/3386.

Edit:

I've rebased this PR on the flush-on-shutdown PR to clean up the code. In doing so, I've also cleaned up some of the logic and branching for simplicity & ease of understanding. No one wants huge if statements and complex branching. This now also contains tests.

My personal preference (FWIW, as a non-maintainer of course) is that they'd both land — minimum latency guarantees and flushes for safety. Of course, that's easy for me to say :)

Feb 27 '24 21:02 tonyhb

I think the flushing PR looks close to good, I am holding off on reviewing this one until the open questions there are addressed and it is submitted.

Jun 03 '24 03:06 vangent

go-cloud go-cloud copied to clipboard

pubsub: add batch timeouts

go-cloud
go-cloud copied to clipboard