watermill icon indicating copy to clipboard operation
watermill copied to clipboard

Allow batch reading in watermill-sql

Open breml opened this issue 4 years ago • 3 comments

The current implementation of watermill-sql does read exactly one message at the time (PostgreSQL implementation: https://github.com/ThreeDotsLabs/watermill-sql/blob/master/pkg/sql/schema_adapter_postgresql.go#L66).

I assume a significant subscriber performance improvement, if the subscriber could read the messages in batches (of configurable size), e.g. 10 or 100 items per SELECT.

My question is, if there are conceptually arguments against this idea.

breml avatar Feb 18 '21 07:02 breml

Following @breml , I've got a similar question about mysql implementation. https://github.com/ThreeDotsLabs/watermill-sql/blob/a2768559a9c416c1d8b5fe506401dc51639abb63/pkg/sql/subscriber.go#L249 Is it possible to optimise the reads?

screwyprof avatar Mar 02 '21 02:03 screwyprof

Today I did some initial tests with batch reading and it is not that easy (and I am no longer sure, it will bring a performance gain).

The problem is: There are known problems with Go, database/sql and the postgres driver when trying to execute additional database queries while still reading a result set (sql.Rows). This confuses the postgres driver and it returns errors instead of executing the queries.

The workaround to the above problem would be to first read all rows (e.g 100) into memory, close the sql.Rows and only then update the offset_consumed field in the DB. I did not implement or test this, because I am no longer convinced, that this will bring the performance gain I was hoping for.

breml avatar Mar 04 '21 14:03 breml

The workaround to the above problem would be to first read all rows (e.g 100) into memory, close the sql.Rows and only then update the offset_consumed field in the DB. I did not implement or test this, because I am no longer convinced, that this will bring the performance gain I was hoping for.

I think that in theory, it should help to achieve some performance gains. The downside is that we need to accept that we will lose exactly-once delivery. But I'd not expect Kafka-like performance - this was not a goal of this Pub/Sub :wink:

What I can recommend, is to try to do some TDD and experimentation of reading more rows at once and committing in a separate transaction. With doing small changes and trusting tests you should be able to develop something that works :) Please keep in mind, that you will need to change test features a bit: https://github.com/ThreeDotsLabs/watermill-sql/blob/master/pkg/sql/pubsub_test.go#L175 (due to lost of exactly-once delivery for example).

roblaszczak avatar Apr 04 '21 08:04 roblaszczak

@roblaszczak @m110 Some thoughts based on my current experience with pub/subs :)

One of the argument for using batch reading is the implementation of batch processing. In case of outbox pattern, we route messages to some broker strictly one by one and it could be a bottleneck. But if we have batch reading we could publish messages concurrently in scope of the batch. E.g. read 100 messages and publish them in with 100 goroutines (relatively speaking) concurrently.

Of course, one by one processing provides delivery order by design. But I think in practice you may need guarantee order of messages only in certain cases. Most use cases don't bother with message ordering.

In cases when we really need message ordering in SQL, it could be considered as a special feature like Google pub/sub message ordering: something like ordering_key in table schema / message could give ability to publish messages sequentially in scope of group with same ordering_key.

For example, we read batch of messages

offset,	ordering_key
1		NULL
2		NULL
3		"key1"
4		"key2"
5		"key1"
6		"key2"
7		NULL

so we could send messages concurrently: [1], [2] ,[7], [3,5], [4,6], where [3,5] and [4,6] -- messages grouped by ordering_key that should be send sequentially (first "3", then "5" etc)

Batch size processing in general has downsides, like partial success and atomicity (when only part of batch was processed):

offset,	ordering_key
1		NULL		sent
2		NULL		failed  <-- this will be a actual acked offset for consumer, so other messages (3-7) will be redelivered on next iteration in current implementation
3		"key1"		sent
4		"key2"		sent
5		"key1"		sent
6		"key2"		sent
7		NULL		sent

but I think in case of "at-least-once" semantics (which is fundamental for pub/subs) it should not be a problem.

condorcet avatar Mar 01 '23 20:03 condorcet

As part of fixing https://github.com/ThreeDotsLabs/watermill/issues/311 I added experimental support for batch reading for Postgres here: https://github.com/ThreeDotsLabs/watermill-sql/pull/22 (warning: this is Proof of concept, so far).

roblaszczak avatar Apr 06 '23 18:04 roblaszczak