pyogrio icon indicating copy to clipboard operation
pyogrio copied to clipboard

Read Arrow: Support for new options as of GDAL 3.8

Open kylebarron opened this issue 1 year ago • 7 comments

I was just re-reading the GetArrowStream API docs and saw that there are two new options as of GDAL 3.8:

  • TIMEZONE="unknown", "UTC", "(+|:)HH:MM" or any other value supported by Arrow. (GDAL >= 3.8)
  • GEOMETRY_METADATA_ENCODING=OGC/GEOARROW (GDAL >= 3.8).

It would be great to support these. The metadata encoding would be useful for my purposes. I don't use timezone information as often, but someone probably would like that.

kylebarron avatar Jan 24 '24 19:01 kylebarron

Looks like we'd need to add a keyword on ogr_open_arrow to set the geometry encoding, maybe geometry_encoding="wkb"|"geoarrow" (not finding "ogc" as a meaningful value here compared to "wkb" which is specific).

brendan-ward avatar Jan 24 '24 19:01 brendan-ward

Oops, I mistakenly conflated the actual geometry encoding - which looks like it is either WKB or native (which may be GeoArrow) without a way to specifically request GeoArrow geometry encoding - with the metadata that sets extension types (which I don't entirely follow here).

Does the geoarrow.wkb extension type mean that the actual geometry encoding is still (OGC) WKB, or is it serialized GeoArrow bytes. Apologies for not being more up to speed on GeoArrow!

brendan-ward avatar Jan 24 '24 19:01 brendan-ward

Looks like we'd need to add a keyword on ogr_open_arrow to set the geometry encoding, maybe geometry_encoding="wkb"|"geoarrow" (not finding "ogc" as a meaningful value here compared to "wkb" which is specific).

I agree this is confusing 🙂 . The key is that the switch does not change the actual encoding of the geometry. The geometries are always WKB. Rather, the switch only changes the metadata applied to the Arrow geometry column. As it is, the geometry column is always loaded with an Arrow extension name of ogc.wkb, which is "non-standard" (if you call GeoArrow a standard 😅).

What this toggle does is change the extension name from ogc.wkb to geoarrow.wkb and includes the CRS as projjson onto the arrow metadata. So for example, right now read_arrow returns a tuple with (metadata, table) = read_arrow() where the metadata object includes information about the CRS. When you set GEOMETRY_METADATA_ENCODING=GEOARROW, that metadata object is no longer needed because the CRS is stored on the table's internal metadata.

kylebarron avatar Jan 24 '24 20:01 kylebarron

Thanks for the explanation. It seems like this is therefore more of a boolean, opt-in option. metadata="ogc"|"geoarrow" doesn't seem great though it somewhat parallels the options on the GDAL side.

What about geoarrow_metadata=False|True? Where True means GEOMETRY_METADATA_ENCODING=GEOARROW and returns the stream with the additional things in the metadata. And presumably if set it means that we don't need to separately try to resolve the CRS using other calls to GDAL.

brendan-ward avatar Jan 24 '24 20:01 brendan-ward

For the geoarrow metadata, I would personally consider making that the default here (I understand that GDAL didn't do that for backwards compatibility, but given this is quite new, I would still make the change here)

jorisvandenbossche avatar Jan 25 '24 08:01 jorisvandenbossche

The TIMEZONE option was added after an issue I opened ((https://github.com/OSGeo/gdal/issues/8460)) triggered by our testing here when adding timezone read/write support in pyogrio.

The problem is that GDAL's data model stores a fixed offset to UTC per individual value, and not an actual time "zone" (like "Europe/Brussels"). While for the Arrow output, it needs to be a single timezone indicator for the full column (as it is part of the column's type). For some formats, Even added a specialized code path to preserve the timezone information. But for other formats, the best it can do is either preserve the wall time (unknown time zone) or convert to UTC.

And so my understanding is that the option allows you to choose between those. But the default is driver-dependent, so the option allows you to override the default of the driver. It's a bit complex to clearly explain ..

jorisvandenbossche avatar Jan 25 '24 08:01 jorisvandenbossche

I don't work with timezones a lot so in #366 I took the simpler route and only tried to add geoarrow metadata handling

kylebarron avatar Feb 26 '24 21:02 kylebarron