geos CoordinateSequence: create using external buffer

This PR adds methods to create a CoordinateSequence backed by an external buffer. Currently this can only be done for XYZ or XYZM sequences until GEOS algorithms are made "2d-safe."

Nov 30 '22 19:11 dbaston

Overall this change causes no performance regression, and a small gain in a few cases.

Nov 30 '22 20:11 dbaston

Nice! Would you consider an equivalent for x/y/... stored in seperate buffers? (Ie the qgis model)

Nov 30 '22 21:11 nyalldawson

Not really... the only reason this works is because the external buffer is exactly the same shape as the internal buffer.

Nov 30 '22 21:11 pramsey

Currently this can only be done for XYZ or XYZM sequences until GEOS algorithms are made "2d-safe."

What needs to be done to make things "2d-safe"?

Nov 30 '22 21:11 dr-jts

What needs to be done to make things "2d-safe"?

Need to stop reading 3D coordinates (Coordinate) by reference without checking whether the sequence is actually 3D or 4D. Not a small task!

Nov 30 '22 22:11 dbaston

Can we mostly start reading CoordinateXY by reference instead?

Nov 30 '22 22:11 pramsey

@pramsey

Not really... the only reason this works is because the external buffer is exactly the same shape as the internal buffer.

Fair enough -- this might be motivation enough for me to rework how QGIS internally stores these arrays!

Nov 30 '22 22:11 nyalldawson

Can we mostly start reading CoordinateXY by reference instead?

It's always safe to read CoordinateXY&, though the assumption that we have a Coordinate& shows up in a surprising number of places.

Simplification and triangulation are good examples...they're 2D algorithms, but they also have this assumption that we're going to preserve z values from the input.

Nov 30 '22 22:11 dbaston

Here's another test I did:

Removed the part of this PR that copies an external XY buffer to XYZ. This should give best case performance but could crash if the sequence is used in parts of the code that assume 3d coordinates.
Updated PostGIS so that GEOS coordinate sequences reference the memory in a POINTARRAY with no copies
Ran ST_IsValid on all polygons in 10 copies of the GADM country boundaries dataset (328 million vertices total), with STORAGE EXTERNAL

When geometries are copied into GEOS, this takes 100.58 seconds. When geometries are created with no coordinate copies, this takes 100.55 seconds. Not a big win for zero-copy!

A second test of a self-join with ST_Intersects takes 80 seconds when coordinates are copied and 74 seconds when the coordinates are reference. Maybe a 7.5% gain is worthwhile?

Feb 09 '23 15:02 dbaston

About a 10% gain on ST_Centroid. I have never been too worried about ST_Centroid as a bottleneck, but maybe better performance for very fast functions could still be useful for clients like GDAL that don't have all of the PostGIS-style short circuits. Calling GEOSIntersection on geometries that are likely to be disjoint, for example. (Although GEOSIntersection currently requires padded inputs)

To safely apply these changes, we'd need to have C API functions that require padded coordinates check their inputs and automatically copy into a new padded coordinate sequence. (This is easy to do with something like seq->reserve(seq->size()).)

Feb 09 '23 15:02 dbaston

As a PostGIS guy, this leaves me cold. The only win is for use cases where they are wrapping and using things like Area() and Length() and Distance() frequently, which... yeah it happens, but as your testing shows, for the "that's too hard to write myself, I need GEOS to do it" use cases, there's no win. I'd vote for GEOS simplicity and letting people write their own easy calculations. And honestly, of course the Rust guys are going to write Rust implementations of Area and Length and Distance! Purity! We saw the same thing with Turf.js and JSTS.

Feb 10 '23 17:02 pramsey