pgx icon indicating copy to clipboard operation
pgx copied to clipboard

Reduce network bandwidth with better statement names

Open mchandler-plato opened this issue 2 months ago • 6 comments

At Plato we have a lot of micro services that have about a dozen sql queries each and i noticed two things looking at a wireshark tcp dump of pgx:

  1. if its calling a function, the statement name pgx creates can actually be longer than the original sql
  2. the statement name includes the full hash of the sql which ends up being pretty long

The reason for investigating was a change in logic (with same request rate) dropped network traffic to postgres by 3x which was surprising and I was trying to understand what was being sent on the wire.


For 1: Would it make sense to skip a prepared statement if the sql is just a call to a function such as:

SELECT my_func();
or
SELECT * FROM my_func();

instead of sending the statement name each time:

stmtcache_88f8794f3ae768b41c7fa5f7902be85a93ac6b97bab5cc06

For 2: Can we make the stmtcache injectable via config (or maybe a sql => name mapper) so that smaller names can be used and save the network bytes.

i.e. from:

stmtcache_88f8794f3ae768b41c7fa5f7902be85a93ac6b97bab5cc06

to logic using an incrementing number like:

s1

I noticed that the logic in StatementName method is almost duplicated in the conn class here though not sure what the difference is https://github.com/jackc/pgx/blob/e4a063e7cb0fe9dd022fb1171d7655f6e837c98e/conn.go#L340

Also I noticed that all the column names are returned when preparing the statement and not sure if that is something that can be skipped or what it is used for.

mchandler-plato avatar Oct 24 '25 01:10 mchandler-plato

Do you have any benchmarks showing how much difference this would make? To be honest, I would be shocked if the length of the name of the prepared statement had a measurable difference. As long as total message to PostgreSQL fits in a single packet I wouldn't expect it to make much real world difference even if the bytes over the wire is significantly different.

Would it make sense to skip a prepared statement if the sql is just a call to a function such as:

My assumption is it would make no difference.

Can we make the stmtcache injectable via config (or maybe a sql => name mapper) so that smaller names can be used and save the network bytes.

Again, my assumption is it would make to meaningful difference. If you can provide a benchmark where this makes a difference then I would be open to making a change.

FYI, in older versions of pgx cached statement names were generated by incrementing a number. The reason for the change was to aid in analytics. With the number the statement name can vary with every run of the program. With a SQL text digest the statement name is stable.

I noticed that the logic in StatementName method is almost duplicated in the conn class here though not sure what the difference is

The difference is the code in conn is used when a statement is explicitly prepared, but no name is provided. The other path is for the statement cache.

Also I noticed that all the column names are returned when preparing the statement and not sure if that is something that can be skipped or what it is used for.

The Describe message also returns the data types of the columns. This is used to know how to encode parameters.


As I said above, I would be shocked if any of this made a significant performance difference, but you should be able to find out fairly easily. You can disable the automatic statement cache and manually prepare and call the statements you want. This would allow you to control whether statements are used and and what names are used. Do this on one of your services that has only a handful of SQL queries and see if it makes a difference.

jackc avatar Oct 24 '25 13:10 jackc

Thanks Jack for your feedback and work on this project.

To be honest, I would be shocked if the length of the name of the prepared statement had a measurable difference. As long as total message to PostgreSQL fits in a single packet I wouldn't expect it to make much real world difference even if the bytes over the wire is significantly different.

Sadly AWS charges for cross zone data transfer (and master db is typically in a single zone) so when doing millions of sql requests a day reducing total bytes transferred does have a financial saving.

https://medium.com/@debyroth340/the-hidden-cross-az-cost-how-we-reduced-aws-data-transfer-cost-by-80-836b6d06886d

FYI, in older versions of pgx cached statement names were generated by incrementing a number. The reason for the change was to aid in analytics. With the number the statement name can vary with every run of the program. With a SQL text digest the statement name is stable.

Do you have any more info on this? Interested to understand.

mchandler-plato avatar Oct 24 '25 14:10 mchandler-plato

To be honest, I would be shocked if the length of the name of the prepared statement had a measurable difference. As long as total message to PostgreSQL fits in a single packet I wouldn't expect it to make much real world difference even if the bytes over the wire is significantly different.

Sadly AWS charges for cross zone data transfer (and master db is typically in a single zone) so when doing millions of sql requests a day reducing total bytes transferred does have a financial saving.

Hmmm... 🫤

FYI, in older versions of pgx cached statement names were generated by incrementing a number. The reason for the change was to aid in analytics. With the number the statement name can vary with every run of the program. With a SQL text digest the statement name is stable.

Do you have any more info on this? Interested to understand.

If you have 2 queries, select * from foo and select * from baz then if the statement cache uses an incremented number to produce the prepared statement name then the order the connection executes those statements determines the name of the statement. Even in a single run of a program different connections will have different prepared statement names for the same statement. By using a digest of the query text, the same query will always have the same prepared statement name across connections, processes, and even versions of the program. This makes analytics easier if you only log the prepared statement name.

jackc avatar Oct 27 '25 22:10 jackc

By using a digest of the query text, the same query will always have the same prepared statement name across connections, processes, and even versions of the program. This makes analytics easier if you only log the prepared statement name.

Oh so this has zero impact on actual logic it just makes a stable name for logs. Got it. Thanks!

I might do a test if I have time with short names vs long names and see if there is a noticeable impact on data transfer and cross zone costs.

mchandler-plato avatar Oct 27 '25 22:10 mchandler-plato

@mchandler-plato Take a look at https://github.com/jackc/pgx/pull/2422. I suspect that would make a much bigger impact on network usage than short statement names. Not sure if I'm willing to accept the increased complexity, but the proof of concept reduced network traffic on small queries by more than half.

jackc avatar Nov 02 '25 00:11 jackc

Interesting PR, is describe needed for the internal workings of the driver?

Also I wonder if it would simplify things if there was a supported way to prepare a sql statement and bind it to a method (maybe with types?) so it could cache that info locally instead of globally.

i.e.

const mySql = `SELECT * FROM my_func($1);

// call is of type func(context.Context, int) (pgx.Rows, error)
call, err := pgx.Prepare[int](ctx, conn, mySql)

if err != nil { panic(err) }

rows, err := call(ctx, 1234);

mchandler-plato avatar Nov 03 '25 17:11 mchandler-plato

Interesting PR, is describe needed for the internal workings of the driver?

Yes. pgx uses the result column types as part of the scanning process.

Also I wonder if it would simplify things if there was a supported way to prepare a sql statement and bind it to a method (maybe with types?) so it could cache that info locally instead of globally.

We need the connection-wide prepared statement map as one of the behaviors of Prepare is it is idempotent. It is safe and has no meaningful performance cost to explicitly call Prepare before every call of a query method that uses it.

jackc avatar Nov 08 '25 23:11 jackc

Reducing bandwidth use is also of interest to us. Various 'serverless' PG providers are starting to charge for public/private network traffic, eg. Neon:

Image

So making sure PGX only sends what's strictly needed is a big plus!

analytically avatar Nov 14 '25 21:11 analytically