r2dbc-postgresql
r2dbc-postgresql copied to clipboard
Inefficient statement-level batching
Using Statement.add() and bind to execute batches is significantly slower than using Connection.createBatch(). Using bindings should be faster than recreating individual textual SQL statements and sending them over the connection.
As far as I have debugged it with wireshark, it looks like Connection.createBatch will send all the (mostly duplicated) SQL statements in a single packet back-to-back. On the other hand, statement batches do not send all of the bind/execute etc command in one go, but each statement is run one by one. Compare this to how JDBC statement-level batching works where all the binds and execute commands are also sent back-to-back in one go. For this reason (but not only), batched JDBC statements are also quite a bit faster than using r2dbc.
(In https://github.com/akka/akka-persistence-r2dbc/commit/2ea412112debf7bb6a1ef75a9ac68cfe1bbe0fec we were considering actual going to connection-level batching but issuing statements without bind (and creating individual textual SQL statements) seems to big of a risk / anti-pattern to work around this issue in the driver.)
(I can try to create another reproduction including wireshark output if you need it, but it should be pretty easy to reproduce)
I think it is running the bindings one-by-one here: https://github.com/pgjdbc/r2dbc-postgresql/blob/a32a679bc196c872d0929a1fd157cccfb2dd9cf3/src/main/java/io/r2dbc/postgresql/PostgresqlStatement.java#L223-L229
The performance you experience pretty much depends on the input variance. Parametrized queries use the extended flow to prepare a query on the server side and then run the prepared query with the bindings provided to the statement. Queries that do not change and are called with the same parameter types should typically yield a similar performance profile as inline-parametrized (static) statements.
The catch with Statement.add(…) is that each Statement may result in a number of results that are fetched individually (open, consume, close cursor). Therefore, we cannot pipeline server requests and send all bindings at once, but must resort to one-by-one processing. On the other hand, Batch collects all statements and sends the entire SQL text (;-concatenated SQL) to the server which is just one call instead of a multi-roundtrip message exchange.
I'm actually wondering why the JDBC driver would yield a better performance profile as it should work exactly the same way. Both drivers have the same amount of information and should be able to have a similar throughput. Maybe @davecramer has some more insights.
The catch with
Statement.add(…)is that eachStatementmay result in a number of results that are fetched individually (open, consume, close cursor).
Good point, I guess, batch statements mostly make sense for updates where you don't care about the result, so that might be difference? Also, there's a difference in API where JDBC only allows to consume the data from the first statement while r2dbc seems to allow consumption of all results?
I'm actually wondering why the JDBC driver would yield a better performance profile as it should work exactly the same way. Both drivers have the same amount of information and should be able to have a similar throughput. Maybe @davecramer has some more insights.
I haven't looked at this in a while. But if @jrudolph is correct we send all the requests at once which minimizes the round trips.
r2dbc seems to allow consumption of all results?
Yes, this is true. I need to think about proposals on how we could optimize here.
Right now, I can imagine that pipelining would work if the fetch size is zero (fetch everything) so we can request all data (send all the requests within a small number of packets) at once without employing a multi-round trip conversation that takes backpressure into account.
I need to think about proposals on how we could optimize here.
If it matters, in my opinion it's most important to optimize this for batch inserts. For inserts you don't care about the result value. I guess it's always 1. If one statement fails the entire batch (transaction) fails.
What we (pgjdbc) do is rewrite the inserts into one insert using multivalue writes ie
insert into x values (1)
insert into x values (2)
becomes
insert into x values (1),(2)
Interesting. R2DBC drivers refrain from rewriting SQL in any form as we do not want to get ourselves into SQL parsing (sigh) or rewriting business. In any case, a multi-value insert can be passed into the driver directly because it is SQL after all.
We don't parse it either. this is one case where we rewrite it though. It can also be rewritten as an array insert into x values (unnest(array[1,2,3,4]));
a multi-value insert can be passed into the driver directly because it is SQL after all
I understand the examples above with literal values, but could you give some hints of how to do that with bind parameters? Would it be just one bind with very many bind parameters? If that is the case, the number of bind parameters would also depend on how many rows to insert, which probably isn't great for prepared statement caching?
Or would that be used with only one bind parameter and the array syntax you showed?
connection.createStatement("insert into x values ($1),($2)").bind("$1", …).bind("$2", …).execute() should do.
prepared statement caching doesn't buy you that much in Postgres. I expect 1 round trip which would be more than madeup for by doing the inserts all in one statement
Taking the example from https://www.postgresql.org/docs/current/dml-insert.html
INSERT INTO products (product_no, name, price) VALUES
($1, $2, $3),
($4, $5, $6),
($7, $8, $9);
would require 9 bind parameters for 3 rows, 12 for 4, and so on? Or did I misunderstand the syntax?
Thanks for the hint, but it feels like an extreme workaround for such a common use case as batch inserts. I hope it can be solved in a better way in the driver itself.
Thanks for the hint, but it feels like an extreme workaround for such a common use case as batch inserts. I hope it can be solved in a better way in the driver itself.
which is why the JDBC driver does it for you