stac-geoparquet icon indicating copy to clipboard operation
stac-geoparquet copied to clipboard

Setting CRS without going to GeoDataFrame?

Open jlaura opened this issue 7 months ago • 6 comments

I might be trying to eat my own tail here. If so, let me know!

I have a STAC API endpoint and am trying to generate GeoParquet summaries of all of the data in each collection. I am not able to set the CRS on the output geoparquet unless I go through GeoPandas. It does not seem that using GeoPandas is actually necessary and I would like to avoid the extra call if I can as this running daily in a lambda. How can I set the output CRS via stac-geoparquet?

Example hitting the backing OpenSearch instance directly:

size = 100
query = {
    "query": {
        "match_all": {}
    },
    "size":size
}

# Initial search
response = client.search(index=collection, body=query, scroll='1m')
scrollid = response.get('_scroll_id', '')
all_features = [i['_source'] for i in response['hits']['hits']]

# Parse to arrow and write to disk (or replace output with a buffer and stream to S3)
record_batch_reader = stac_geoparquet.arrow.parse_stac_items_to_arrow(all_features)
stac_geoparquet.arrow.to_parquet(record_batch_reader, 'output.geoparquet')

The CRS on the resulting GeoParquet file is WGS84. Each geometry has a well-formatted proj:projjson CRS that is "IAU_2015:30100" (the Moon, lat/lon). Is it possible to set the CRS without going through geopandas?

jlaura avatar May 21 '25 17:05 jlaura

Doesn't STAC mandate that the geometry field is always in WGS84?

Are you referring to the proj:geometry field?

kylebarron avatar May 21 '25 21:05 kylebarron

@jlaura I think I'm following. stac-geoparquet is not setting the CRS of the proj:geometry column, which it could via proj:code. Reproducer:

$ uv pip install rio-stac
$ rio stac \
    https://raw.githubusercontent.com/developmentseed/rio-stac/refs/heads/main/tests/fixtures/dataset_geo.tif \
    > item.json
$ jq '.properties."proj:geometry"' item.json -c
{"type":"Polygon","coordinates":[[[373185.0,8019284.949381611],[639014.9492102272,8019284.949381611],[639014.9492102272,8286015.0],[373185.0,8286015.0],[373185.0,8019284.949381611]]]}

Quick conversion script:

import json

import stac_geoparquet.arrow

with open("item.json") as f:
    item = json.load(f)

table = stac_geoparquet.arrow.parse_stac_items_to_arrow([item])
stac_geoparquet.arrow.to_parquet(table, "items.parquet")

Then, via gpq:

$ gpq describe items.parquet --metadata-only | jq '.columns."proj:geometry"'
{
  "encoding": "WKB",
  "geometry_types": [],
  "crs": null
}

To Kyle's point, geometry will be WGS84. geoparquet doesn't have the concept of a "file-wide" CRS, though it does have a primary_column key in the metadata to indicate the default geometry column. @jlaura I could see two enhancements/fixes:

  1. Use proj:code or the other proj CRS fields to set the proj:geometry CRS
  2. Allow the user to specify the primary_column

gadomski avatar May 22 '25 12:05 gadomski

  • Use proj:code or the other proj CRS fields to set the proj:geometry CRS

This only works in the (rare?) case where all proj:geometry values in a dataset have the exact same CRS, which is why the current code doesn't even try to handle it.

kylebarron avatar May 22 '25 14:05 kylebarron

Doesn't STAC mandate that the geometry field is always in WGS84?

Are you referring to the proj:geometry field?

Yes, 100%. Which, is arguably wrong, because not all lat/lons are WGS84 and in the non-Earth case there is no transformation available. We simply ignore this, treat lat/lon as lat/lon, and then include the proj2.0 extension to help users. Ideally, we don't want to propagate that WGS84 hard coding out further, into the geoparquet files (see below re: the GeoParquet spec).

I think I'm following. stac-geoparquet is not setting the CRS of the proj:geometry column, which it could via proj:code

I think that proj:code is workable, though dangerous. Doesn't that assume that the proj DB has the code in it? Perhaps proj:code or proj:wkt2 is safer? That also looks like it mirrors how Geopandas is handling this (specifically, CRS object that takes code or WKT string and then a per geometry / row WKB conversion). Based on the spec, maybe proj:projjson is best option because then this library is not having to attempt a WKT -> ProjJson or Code - > ProjJson conversion.

To Kyle's point, geometry will be WGS84. I think I am misunderstanding something out of the spec (and how GDAL/QGIS is working with the GeoParquet files). In the spec, I see that a column can have a projjson specified, non-WGS84 CRS, under the Column Metadata heading. The primary_column then specifies the primary geometry column. Isn't that, in effect, setting the CRS for the GeoParquet file? I fully understand that other geometry columns could have different CRS, but practically do not see a difference.

Thanks for iterating on this guys! The library is quite fun to play with.

jlaura avatar May 22 '25 16:05 jlaura

Doesn't STAC mandate that the geometry field is always in WGS84? Are you referring to the proj:geometry field?

Yes, 100%. Which, is arguably wrong, because not all lat/lons are WGS84 and in the non-Earth case there is no transformation available. We simply ignore this, treat lat/lon as lat/lon, and then include the proj2.0 extension to help users. Ideally, we don't want to propagate that WGS84 hard coding out further, into the geoparquet files (see below re: the GeoParquet spec).

But the STAC spec defines it to be WGS84... I'm not sure STAC-GeoParquet should diverge from the STAC spec.

I think that proj:code is workable, though dangerous. Doesn't that assume that the proj DB has the code in it? Perhaps proj:code or proj:wkt2 is safer? That also looks like it mirrors how Geopandas is handling this (specifically, CRS object that takes code or WKT string and then a per geometry / row WKB conversion).

That GeoPandas code is iterating over columns, not rows. GeoParquet only allows one CRS per column.

kylebarron avatar May 22 '25 16:05 kylebarron

That GeoPandas code is iterating over columns, not rows. GeoParquet only allows one CRS per column.

Understood. That is what I am trying to set, a column CRS for all of the geometries in the column.

jlaura avatar May 22 '25 16:05 jlaura