postgres-nio Significant performance regression going from Postgres 13 → 14+ in test code

I've been investigating a big run time regression in our test suite after switching the Postgres version in our docker container from 13 to 16, which I've subsequently narrowed down to happening when going from 13 to 14.

It's been suggested that this might be due to how PostgresNIO is adopting changes in Postgres 14, perhaps "pipelining support".

There are more details in the tracking issue on our end (the one linked above) which I'll briefly summarise here:

we're seeing a x3 run time regression running out test suite when using a Postgres 14+ container vs Postgres 13
running a pure SQL test script against different Postgres versions with a copy of our production database shows that Postgres 14+ are in fact faster than 13
this leads me to believe the problem isn't the Postgres version per se but something higher up the stack

Here are the figures from the issue:

SPI test suite, running locally on a Mac Studio:

Pg 13: 52s
Pg 14: 162s
Pg 16: 169s

There's a clear regression going from Postgres 13 → 14.

This is not the case when running plain SQL against those db versions (with copies of our production data, so it's a real world scenario):

Pg 13: 1.79s
Pg 14: 1.65s
Pg 16: 1.66s

I don't know if or how this could be an issue in PostgresNIO but it doesn't seem to be a db issue. Are there any other tests I could be running to help figure out where the slowdown could be occurring?

Sep 07 '24 11:09 finestructure

@finestructure Can you help me investigate this? Can you provide me a SQL query and a matching database that becomes slower with Postgres 14?

Sep 07 '24 13:09 fabianfett

I'll try and come up with something that involves less of our stack (and who knows, maybe I'll find the cause isn't PostgresNIO in the process 😅).

Sep 07 '24 13:09 finestructure

So the good news is that a perf test I created here: https://github.com/finestructure/pg-perf-regression.git does not show any performance regression in PostgresNIO.

I've got more details in the original issue.

As I conclude there, I think that leaves us with two possible explanations for the x3 performance regression:

The SQL test load is significantly different from what we do in our tests, or at the very least we do something else that is not captured by my SQL load test.
There's an issue higher up the stack in Vapor or in our code that behaves dramatically differently depending on what Postgres version is being targeted.

I'll close this issue for now while I try to figure out which it is - or if it's something else entirely 😅

Any ideas what else I could try greatly appreciated!

Sep 09 '24 09:09 finestructure

I believe I know why I couldn't reproduce the issue at first. I've got some details here and will try to create a reproducer that better reflects what we're doing.

Some details here: https://github.com/SwiftPackageIndex/SwiftPackageIndex-Server/issues/3360#issuecomment-2348434200

Sep 13 '24 09:09 finestructure

I'll start by saying that this issue is quite bizarre 😅 There seem to be multiple factors at play. For one, the issue seems to happen when running in an XCTest. That's why I couldn't reproduce it standalone at first.

There are also varying degrees of slowdown in an SQL query depending on how many times another SQL query ran before it.

Best I can tell this is not a measurement effect. I see this same slowdown in our actual tests.

I've set up a pure PostgresNIO project here with my tests that resemble our unit tests: https://github.com/finestructure/pg-perf-regression.git

Simply running

./run-perf-test.sh 13 test1
./run-perf-test.sh 13 test2
./run-perf-test.sh 13 test3
./run-perf-test.sh 14 test1
./run-perf-test.sh 14 test2
./run-perf-test.sh 14 test3

should yield all the figures. I've plotted my results here https://www.icloud.com/numbers/0945R88oKsGdVM5kY89hT7eMA#Issue_3360_perf_regression and am attaching screenshots below for easier viewing.

The core bit of the test is this bit here:

        // test1
        try await recreateDatabase(testDbName)
        try await recreateDatabase(testDbName)
        try await createSnapshot(original: testDbName, snapshot: snapshotName)

        // test2
        try await recreateDatabase(testDbName)
        try await createSnapshot(original: testDbName, snapshot: snapshotName)

        // test3
        try await createSnapshot(original: testDbName, snapshot: snapshotName)

The timing is only done within createSnapshot.

Observations are:

test1 runs in ~9ms on Pg13 vs ~60ms on Pg14 -> 6.6x slower
test2 runs in ~16ms on Pg13 vs ~60ms on Pg14 -> 3.9x slower
test3 runs in ~27ms on Pg13 vs ~67ms on Pg14 -> 2.5x slower

Please double check my work here! While I've run these tests and variants hundreds of times now, and have found them reproducible, it's easy to get stuck in a groove and overlook something obvious. I think I'm testing the right thing and in a sensible fashion, and it does reflect our real-world observations, but it could still well be that I'm timing the wrong thing or something like that.

It's important to note that this is not a micro-benchmark regressing. We're a seeing large (3x) real world increase in our test runtime to the degree that we keep testing with Pg 13 locally for now.

CleanShot 2024-09-16 at 13 02 02@2x

Sep 16 '24 11:09 finestructure

The numbers change a little but show generally the same trends when running in release mode:

test1 2.4x slower 13 → 14
test2 1.8x slower 13 → 14
test3 1.3x slower 13 → 14

I've added the figures and graphs in new tabs in the numbers document: https://www.icloud.com/numbers/0945R88oKsGdVM5kY89hT7eMA#Issue_3360_perf_regression

Sep 17 '24 08:09 finestructure

There was a default change made to the vacuum default for cleaning up indexes that changed from postgres 13 to 14 (from ON to AUTO) detailed here which may be affecting your results. Check to see if the prescribed solution of adding INDEX_CLEANUP ON to the vacuum command helps.

Sep 19 '24 14:09 JawadHF

The latest test is essentially just running

DROP DATABASE IF EXISTS \(databaseName) WITH (FORCE)
CREATE DATABASE \(databaseName)

followed by

DROP DATABASE IF EXISTS \(snapshot) WITH (FORCE)
CREATE DATABASE \(snapshot) TEMPLATE \(original)

I wouldn't expect changes to VACUUM to have any effect here?

Sep 20 '24 08:09 finestructure

Fair enough. I wouldn't expect dropping and creating the database to be affected by vacuum.

So the plot thickens...

Sep 22 '24 21:09 JawadHF

Just to follow up on discussions we had at the conference, @fabianfett, and to recap where we're at with this rather bizarre and confusing issue!

I've re-run the tests on another machine and the figures hold. (Please give it a quick try: there are no dependencies required other than docker, it's just git clone https://github.com/finestructure/pg-perf-regression.git and ./run-perf-test.sh 13 test1!)
The slow-down is only happening in the context of running tests, it seems. Or at least I've not found any other way to demonstrate the slow-downs and we originally saw them in our tests. There's no indication it's slowing down production use of PgNIO.
It's also slowing down to the same degree when I convert the XCTest to the new Swift Testing framework. (See the swift-testing branch in the pg-perf-regression repo.)

Hope that helps tracking this down! Short of converting our tests to run in parallel (which is tricky when running against db instances), I don't see a way for us to "fix" this regression on our end.

Oct 02 '24 10:10 finestructure

Hi Sven :wave: thanks for digging so deep into this issue. I brought this up with my team.

The slow-down is only happening in the context of running tests, it seems. Or at least I've not found any other way to demonstrate the slow-downs and we originally saw them in our tests. There's no indication it's slowing down production use of PgNIO.

As this only affects test code, we don't plan to investigate this issue in the near future, as we currently have higher priority work that we need to handle.

Of course PRs from the community are always welcome and we'd be happy to review those, if they solve this issue.

Oct 02 '24 13:10 fabianfett

As this only affects test code

I just want to stress that my tests don't allow that conclusion. Tests is where we've first seen the regression and it's the only place I've been able to demonstrate it so far but that doesn't mean it only affects tests.

Oct 02 '24 13:10 finestructure

From the investigations i did over on the Vapor Discord server:

the solution is, use these env vars for the docker container to initialize the DB to make things consistent across different pg versions:

    -e POSTGRES_HOST_AUTH_METHOD=md5 \
    -e POSTGRES_INITDB_ARGS="--auth-host=md5" \

or for test-only purposes, use trust for no password instead of md5.

For reference: https://hub.docker.com/_/postgres See POSTGRES_HOST_AUTH_METHOD about why the slowness starts on pg 14 It says:

This optional variable can be used to control the auth-method for host connections for all databases, all users, and all addresses. If unspecified then scram-sha-256 password authentication⁠ is used (in 14+; md5 in older releases).

This is at least true for the test case that @finestructure has put together.

"Proof" images; 8ms vs 34ms from start till getting connection properties. postgres 13 md5 vs postgres 14 scram-sha-256:

while SASL being slower than md5 is not particularly a fault of postgres-nio, @fabianfett still confirmed that the SASL implementation is of suboptimal performance, and that's why he has put together this PR: #553

May 14 '25 17:05 MahdiBM

Making SASL much faster in #554.

May 15 '25 08:05 fabianfett

Based on my benchmarks, @fabianfett 's PRs have made SASL super fast. Almost as fast as md5 even though SASL requires a second roundtrip. So that problem can be considered solved. Here are the benchmarks. The format is:

line-1: Config is: <pg-version> <test-name> <rounds-count> <auth-method>
line-2: total: <total-time-in-ms> avg: <avg-time-per-run-in-ms>

The benchmarks:

Config is: 13 test1 100 trust
total: 2399.819731712341323 avg: 23.998
Config is: 14 test1 100 trust
total: 2068.048596382141115 avg: 20.680
Config is: 15 test1 100 trust
total: 2655.481934547424323 avg: 26.554
Config is: 16 test1 100 trust
total: 2832.032561302185066 avg: 28.320
Config is: 17 test1 100 trust
total: 2706.352829933166506 avg: 27.063

Config is: 13 test1 100 md5
total: 2226.017117500305179 avg: 22.260
Config is: 14 test1 100 md5
total: 2272.173047065734861 avg: 22.721
Config is: 15 test1 100 md5
total: 3099.039793014526361 avg: 30.990
Config is: 16 test1 100 md5
total: 2999.690771102905263 avg: 29.996
Config is: 17 test1 100 md5
total: 2970.742106437683109 avg: 29.707

Config is: 13 test1 100 scram-sha-256
total: 2604.927778244018543 avg: 26.049
Config is: 14 test1 100 scram-sha-256
total: 2647.693991661071779 avg: 26.476
Config is: 15 test1 100 scram-sha-256
total: 3384.278655052185068 avg: 33.842
Config is: 16 test1 100 scram-sha-256
total: 3329.155445098876946 avg: 33.291
Config is: 17 test1 100 scram-sha-256
total: 3473.513007164001473 avg: 34.735

As you might notice, pg 13 and 14 basically have the same speed, and scram-sha-256 is only reasonably slower than md5 or trust.

A new issue is that postgres 15+ are being slower than 13 and 14. Based on my investigations this has nothing to do with postgres-nio. The database is simply responding slower:

pg-14 taking 37ms for a connection start to end:

pg-15 taking 50ms for a connection start to end:

As you can see, the slowness is in the last packet, which is an incoming packet from the db to postgres-nio. So it's unlikely that pg-nio has any fault here.

May 15 '25 09:05 MahdiBM

@MahdiBM thank you for your investigation here! @finestructure do you think we can close this as 1.26.1 has been released?

May 15 '25 10:05 fabianfett

In terms of our initial problem I can confirm that simply running the tests with PG 14+ and md5 auth fixes the regression we've been seeing:

previous pg 13 (which used md5 by default)
::notice ::Test run with 795 tests passed after 7.652 seconds
::notice ::Test run with 795 tests passed after 9.245 seconds
::notice ::Test run with 795 tests passed after 8.898 seconds

with pg16 + md5
::notice ::Test run with 795 tests passed after 9.341 seconds
::notice ::Test run with 795 tests passed after 9.090 seconds
::notice ::Test run with 795 tests passed after 6.745 seconds

(The tests are much faster now in general since originally reported, because we've moved to parallel testing in the meantime, in large part due to this regression. Good changes all around 😅)

Happy for this to be closed, although the newly discovered PG 15+ issue might be something to track elsewhere if not here?

Huge thanks to @MahdiBM for your help tracking this down!

May 15 '25 15:05 finestructure

@finestructure I'll be posting about the PG 15+ issue in the Postgres mailing lists, and i'll report back. Though this issue can be closed since that doesn't seem to be pg-nio related.

May 15 '25 15:05 MahdiBM

The promised postgres mailing list post

May 15 '25 16:05 MahdiBM

The original premise of this ticket was to track the hard perf cliff between Postgres 13 and 14. We tracked it down to the different default auth mechanism. (md5 vs. SASL). In #554 we added significant perf improvements to our SASL handshake.

Let's close this issue. @MahdiBM and @finestructure please open a follow up issue for the PG15+ regression, if you want to.

May 16 '25 07:05 fabianfett

Just to completely close out the issue, the remaining performance regression from 14 → 15 was due a change in database creation strategy in 15. I've added more details in the original issue on our end.

May 29 '25 06:05 finestructure