geoparquet icon indicating copy to clipboard operation
geoparquet copied to clipboard

Bounding box column updates

Open kylebarron opened this issue 1 year ago • 3 comments

  • More clearly specify that the bounding box column may have 4 or 6 fields: e.g. that 3D bounding boxes are allowed
  • Strictly specify the ordering of the fields within the struct.
  • More clearly specify that if zmin is included, then zmax must also be included, and vice versa.

kylebarron avatar Apr 15 '24 20:04 kylebarron

@maxxen - is there any problems for DuckDB if we did this proposed strict specification of order in the struct?

cholmes avatar May 06 '24 19:05 cholmes

@cholmes I can't think of any reason why the ordering would matter for us right now, so feel free to go ahead!

( I also kind of like: xmin, xmax, ymin, ymax [zmin, zmax] as the optional axis won't be interleaved)

Maxxen avatar May 06 '24 19:05 Maxxen

( I also kind of like: xmin, xmax, ymin, ymax [zmin, zmax] as the optional axis won't be interleaved)

We had a discussion today about this on the call and various people supported that.

kylebarron avatar May 06 '24 20:05 kylebarron

The main annoyance of using xmin, xmax, ymin, ymax [, zmin, zmax] is that this is then inconsistent with the bbox column-level metadata, where we require an array of [<xmin>, <ymin>, <xmax>, <ymax>] or [<xmin>, <ymin>, <zmin>, <xmax>, <ymax>, <zmax>] ... (https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#bbox)

It's of course a different kind of data structure, the latter is a json array, and what we are discussing here is just the order of fields in the schema. But it would still have been nice to use the same order. Although I definitely agree that xmin, xmax, ymin, ymax [, zmin, zmax] sounds easier to handle

jorisvandenbossche avatar May 13 '24 07:05 jorisvandenbossche

But it would still have been nice to use the same order.

+1

rouault avatar May 13 '24 12:05 rouault

Alright, well if thats already standardized then maybe it makes more sense to follow the existing convention.

Maxxen avatar May 13 '24 12:05 Maxxen

I believe the column metadata (bbox and geometry_types) follows GeoJSON, which makes sense as it is, in fact, JSON. I can see the argument for consistency, but also that the code iterating over both of those are at different levels of abstraction and I am not sure we risk mass confusion by choosing an option for the bounding box column that makes it easier to write scanners.

We would like to add a box/bbox/rect type to GeoArrow and we will probably use whatever order is decided on here to make sure our compute functions don't have to reorder anything...not that this spec is responsible for the choices in GeoArrow, but GeoArrow is a place where it would be nice to ensure that a 3D (or 4D) bounding box can share an implementation with something that only cares about 2D bounding boxes.

As above, though, I'm +1 on any option that makes the ordering explicit, since the workaround is not particularly verbose (scan_something(ArrowArray* array) vs. scan_something(ArrowArray* array, int* index_of_xy).

paleolimbot avatar May 13 '24 14:05 paleolimbot

Going to merge this one in, as it's got approvals and extensive discussion. Any further alternatives can be discussed as PR's to main.

cholmes avatar May 29 '24 03:05 cholmes