r2dbc-postgresql Inefficient statement-level batching

trafficstars

Using Statement.add() and bind to execute batches is significantly slower than using Connection.createBatch(). Using bindings should be faster than recreating individual textual SQL statements and sending them over the connection.

As far as I have debugged it with wireshark, it looks like Connection.createBatch will send all the (mostly duplicated) SQL statements in a single packet back-to-back. On the other hand, statement batches do not send all of the bind/execute etc command in one go, but each statement is run one by one. Compare this to how JDBC statement-level batching works where all the binds and execute commands are also sent back-to-back in one go. For this reason (but not only), batched JDBC statements are also quite a bit faster than using r2dbc.

(In https://github.com/akka/akka-persistence-r2dbc/commit/2ea412112debf7bb6a1ef75a9ac68cfe1bbe0fec we were considering actual going to connection-level batching but issuing statements without bind (and creating individual textual SQL statements) seems to big of a risk / anti-pattern to work around this issue in the driver.)

Jul 05 '22 10:07 jrudolph

(I can try to create another reproduction including wireshark output if you need it, but it should be pretty easy to reproduce)

Jul 05 '22 10:07 jrudolph

I think it is running the bindings one-by-one here: https://github.com/pgjdbc/r2dbc-postgresql/blob/a32a679bc196c872d0929a1fd157cccfb2dd9cf3/src/main/java/io/r2dbc/postgresql/PostgresqlStatement.java#L223-L229

Jul 05 '22 11:07 patriknw

The performance you experience pretty much depends on the input variance. Parametrized queries use the extended flow to prepare a query on the server side and then run the prepared query with the bindings provided to the statement. Queries that do not change and are called with the same parameter types should typically yield a similar performance profile as inline-parametrized (static) statements.

The catch with Statement.add(…) is that each Statement may result in a number of results that are fetched individually (open, consume, close cursor). Therefore, we cannot pipeline server requests and send all bindings at once, but must resort to one-by-one processing. On the other hand, Batch collects all statements and sends the entire SQL text (;-concatenated SQL) to the server which is just one call instead of a multi-roundtrip message exchange.

I'm actually wondering why the JDBC driver would yield a better performance profile as it should work exactly the same way. Both drivers have the same amount of information and should be able to have a similar throughput. Maybe @davecramer has some more insights.

Jul 05 '22 11:07 mp911de

The catch with Statement.add(…) is that each Statement may result in a number of results that are fetched individually (open, consume, close cursor).

Good point, I guess, batch statements mostly make sense for updates where you don't care about the result, so that might be difference? Also, there's a difference in API where JDBC only allows to consume the data from the first statement while r2dbc seems to allow consumption of all results?

Jul 05 '22 11:07 jrudolph

I'm actually wondering why the JDBC driver would yield a better performance profile as it should work exactly the same way. Both drivers have the same amount of information and should be able to have a similar throughput. Maybe @davecramer has some more insights.

I haven't looked at this in a while. But if @jrudolph is correct we send all the requests at once which minimizes the round trips.

Jul 05 '22 11:07 davecramer

r2dbc seems to allow consumption of all results?

Yes, this is true. I need to think about proposals on how we could optimize here.

Right now, I can imagine that pipelining would work if the fetch size is zero (fetch everything) so we can request all data (send all the requests within a small number of packets) at once without employing a multi-round trip conversation that takes backpressure into account.

Jul 06 '22 05:07 mp911de

I need to think about proposals on how we could optimize here.

If it matters, in my opinion it's most important to optimize this for batch inserts. For inserts you don't care about the result value. I guess it's always 1. If one statement fails the entire batch (transaction) fails.

Jul 06 '22 06:07 patriknw

What we (pgjdbc) do is rewrite the inserts into one insert using multivalue writes ie

insert into x values (1)
insert into x values (2)

becomes

insert into x values (1),(2)

Jul 06 '22 12:07 davecramer

Interesting. R2DBC drivers refrain from rewriting SQL in any form as we do not want to get ourselves into SQL parsing (sigh) or rewriting business. In any case, a multi-value insert can be passed into the driver directly because it is SQL after all.

Jul 06 '22 13:07 mp911de

We don't parse it either. this is one case where we rewrite it though. It can also be rewritten as an array insert into x values (unnest(array[1,2,3,4]));

Jul 06 '22 13:07 davecramer

a multi-value insert can be passed into the driver directly because it is SQL after all

I understand the examples above with literal values, but could you give some hints of how to do that with bind parameters? Would it be just one bind with very many bind parameters? If that is the case, the number of bind parameters would also depend on how many rows to insert, which probably isn't great for prepared statement caching?

Or would that be used with only one bind parameter and the array syntax you showed?

Jul 06 '22 13:07 patriknw

connection.createStatement("insert into x values ($1),($2)").bind("$1", …).bind("$2", …).execute() should do.

Jul 06 '22 14:07 mp911de

prepared statement caching doesn't buy you that much in Postgres. I expect 1 round trip which would be more than madeup for by doing the inserts all in one statement

Jul 06 '22 14:07 davecramer

Taking the example from https://www.postgresql.org/docs/current/dml-insert.html

INSERT INTO products (product_no, name, price) VALUES
    ($1, $2, $3),
    ($4, $5, $6),
    ($7, $8, $9);

would require 9 bind parameters for 3 rows, 12 for 4, and so on? Or did I misunderstand the syntax?

Thanks for the hint, but it feels like an extreme workaround for such a common use case as batch inserts. I hope it can be solved in a better way in the driver itself.

Jul 06 '22 14:07 patriknw

Thanks for the hint, but it feels like an extreme workaround for such a common use case as batch inserts. I hope it can be solved in a better way in the driver itself.

which is why the JDBC driver does it for you

Jul 06 '22 14:07 davecramer

r2dbc-postgresql r2dbc-postgresql copied to clipboard

Inefficient statement-level batching

r2dbc-postgresql
r2dbc-postgresql copied to clipboard