gdal icon indicating copy to clipboard operation
gdal copied to clipboard

Use COPY BINARY for OGR PostgreSQL / PostGIS driver

Open bjornharrtell opened this issue 1 year ago • 3 comments

I noticed that the OGR PostgreSQL / PostGIS driver is using COPY in text mode and geometry is transfered by encoding it to hex EWKB representation.

I believe COPY BINARY mode with native PostGIS binary geometry representation would perform significantly better.

The drawback is possibly that native PostGIS binary geometry representation is an implementation detail of PostGIS. However, I would regard it as stable due to that it's not something PostGIS can break in a major version. AFAIK they did not make any changes to it between PostGIS 2 and 3 either.

Another similar detail is probably jsonb which internal representation in PostgreSQL is a bit of a secret. Not sure how/if OGR supports jsonb anyway though.

COPY BINARY is of course also significantly more efficient for non-spatial columns.

bjornharrtell avatar Sep 15 '22 20:09 bjornharrtell

Many years ago, I disabled in 1d0d64b67ebe1fec2e185718c2da123c5c3dba90 the use of binary cursors that was available through the "PGB:dbname=...." connection string with this commit message "PG: [...] disable code that was intended to handled binary cursors as it cannot be triggered". Unfortunately, I can't remember why the code couldn't be triggered (before this change). Anyway it wasn't obvious if binary cursors were really faster, or perhaps there were issues to support some data types, as you mention this relies on undocumented/internal Postgres representation. I'd expect COPY BINARY to have the similar challenges. Any benchmark on how much COPY BINARY would be more efficient before going into the hassle of coding it ? That would imply that PostgreSQL can serialize on disk faster than the network connection / Unix socket, which might be true on slow connections, but not when GDAL and the server are on the same host, right ? Doesn't sslcompression (https://stackoverflow.com/questions/7218571/postgresql-tcp-data-compression) do a right job at minimizing data that transits between client and server ?

rouault avatar Sep 15 '22 20:09 rouault

Agree, before going ahead with this some quantative benchmarks would be in order. I'm actually not thinking about optimizing I/O or size here, it's the CPU required to encode everything to text and having PostgreSQL decode it again can be a significant bottleneck especially under optimal I/O conditions.

I've been happily using COPY BINARY in some custom Java and C# migration tools for large volumes of data but I admit I never compared performance with COPY TEXT.

bjornharrtell avatar Sep 15 '22 21:09 bjornharrtell

Some quick tests for local PostgreSQL not involving any external code. I had a 1.9 GB table with assorted columns including geometry and rountripped it with COPY vs COPY BINARY:

COPY

  • Getting data out took 21 seconds resulting in a 2.8 GB file
  • Getting data in took 65 seconds

COPY BINARY

  • Getting data out took 10 seconds resulting in a 1.5 GB file
  • Getting data in took 33 seconds

This was on local postgresql and CPU appears to be the bottleneck in that case. So we're looking at 2x speedup when not constrained by I/O. I suppose theoretically similar gains is to be had in GDAL if switching from text encode to the raw binary form, so perhaps 4x speedup.

bjornharrtell avatar Sep 16 '22 07:09 bjornharrtell

That sounds like an interesting idea of optimization. It should probably only be used after verifying that the columns of the table are of a known type for which we have verified the binary encoding (and if not, fallback to the existing text-based interface) A logic similar to https://github.com/OSGeo/gdal/blob/eb25a2f3e9f278b3d043b9fe1d87d6a3fbc8c23d/ogr/ogrsf_frmts/pg/ogrpgdatasource.cpp#L773 , extended to other data types, could also be applied to check that the binary format returned by the server matches our expectations.

rouault avatar Nov 22 '22 17:11 rouault