watermill
watermill copied to clipboard
Allow batch reading in watermill-sql
The current implementation of watermill-sql does read exactly one message at the time (PostgreSQL implementation: https://github.com/ThreeDotsLabs/watermill-sql/blob/master/pkg/sql/schema_adapter_postgresql.go#L66).
I assume a significant subscriber performance improvement, if the subscriber could read the messages in batches (of configurable size), e.g. 10 or 100 items per SELECT.
My question is, if there are conceptually arguments against this idea.
Following @breml , I've got a similar question about mysql implementation. https://github.com/ThreeDotsLabs/watermill-sql/blob/a2768559a9c416c1d8b5fe506401dc51639abb63/pkg/sql/subscriber.go#L249 Is it possible to optimise the reads?
Today I did some initial tests with batch reading and it is not that easy (and I am no longer sure, it will bring a performance gain).
The problem is: There are known problems with Go, database/sql and the postgres driver when trying to execute additional database queries while still reading a result set (sql.Rows). This confuses the postgres driver and it returns errors instead of executing the queries.
The workaround to the above problem would be to first read all rows (e.g 100) into memory, close the sql.Rows and only then update the offset_consumed field in the DB. I did not implement or test this, because I am no longer convinced, that this will bring the performance gain I was hoping for.
The workaround to the above problem would be to first read all rows (e.g 100) into memory, close the sql.Rows and only then update the offset_consumed field in the DB. I did not implement or test this, because I am no longer convinced, that this will bring the performance gain I was hoping for.
I think that in theory, it should help to achieve some performance gains. The downside is that we need to accept that we will lose exactly-once delivery. But I'd not expect Kafka-like performance - this was not a goal of this Pub/Sub :wink:
What I can recommend, is to try to do some TDD and experimentation of reading more rows at once and committing in a separate transaction. With doing small changes and trusting tests you should be able to develop something that works :) Please keep in mind, that you will need to change test features a bit: https://github.com/ThreeDotsLabs/watermill-sql/blob/master/pkg/sql/pubsub_test.go#L175 (due to lost of exactly-once delivery for example).
@roblaszczak @m110 Some thoughts based on my current experience with pub/subs :)
One of the argument for using batch reading is the implementation of batch processing. In case of outbox pattern, we route messages to some broker strictly one by one and it could be a bottleneck. But if we have batch reading we could publish messages concurrently in scope of the batch. E.g. read 100 messages and publish them in with 100 goroutines (relatively speaking) concurrently.
Of course, one by one processing provides delivery order by design. But I think in practice you may need guarantee order of messages only in certain cases. Most use cases don't bother with message ordering.
In cases when we really need message ordering in SQL, it could be considered as a special feature like Google pub/sub message ordering: something like ordering_key in table schema / message could give ability to publish messages sequentially in scope of group with same ordering_key.
For example, we read batch of messages
offset, ordering_key
1 NULL
2 NULL
3 "key1"
4 "key2"
5 "key1"
6 "key2"
7 NULL
so we could send messages concurrently: [1], [2] ,[7], [3,5], [4,6],
where [3,5] and [4,6] -- messages grouped by ordering_key that should be send sequentially (first "3", then "5" etc)
Batch size processing in general has downsides, like partial success and atomicity (when only part of batch was processed):
offset, ordering_key
1 NULL sent
2 NULL failed <-- this will be a actual acked offset for consumer, so other messages (3-7) will be redelivered on next iteration in current implementation
3 "key1" sent
4 "key2" sent
5 "key1" sent
6 "key2" sent
7 NULL sent
but I think in case of "at-least-once" semantics (which is fundamental for pub/subs) it should not be a problem.
As part of fixing https://github.com/ThreeDotsLabs/watermill/issues/311 I added experimental support for batch reading for Postgres here: https://github.com/ThreeDotsLabs/watermill-sql/pull/22 (warning: this is Proof of concept, so far).