Setting CRS without going to GeoDataFrame?
I might be trying to eat my own tail here. If so, let me know!
I have a STAC API endpoint and am trying to generate GeoParquet summaries of all of the data in each collection. I am not able to set the CRS on the output geoparquet unless I go through GeoPandas. It does not seem that using GeoPandas is actually necessary and I would like to avoid the extra call if I can as this running daily in a lambda. How can I set the output CRS via stac-geoparquet?
Example hitting the backing OpenSearch instance directly:
size = 100
query = {
"query": {
"match_all": {}
},
"size":size
}
# Initial search
response = client.search(index=collection, body=query, scroll='1m')
scrollid = response.get('_scroll_id', '')
all_features = [i['_source'] for i in response['hits']['hits']]
# Parse to arrow and write to disk (or replace output with a buffer and stream to S3)
record_batch_reader = stac_geoparquet.arrow.parse_stac_items_to_arrow(all_features)
stac_geoparquet.arrow.to_parquet(record_batch_reader, 'output.geoparquet')
The CRS on the resulting GeoParquet file is WGS84. Each geometry has a well-formatted proj:projjson CRS that is "IAU_2015:30100" (the Moon, lat/lon). Is it possible to set the CRS without going through geopandas?
Doesn't STAC mandate that the geometry field is always in WGS84?
Are you referring to the proj:geometry field?
@jlaura I think I'm following. stac-geoparquet is not setting the CRS of the proj:geometry column, which it could via proj:code. Reproducer:
$ uv pip install rio-stac
$ rio stac \
https://raw.githubusercontent.com/developmentseed/rio-stac/refs/heads/main/tests/fixtures/dataset_geo.tif \
> item.json
$ jq '.properties."proj:geometry"' item.json -c
{"type":"Polygon","coordinates":[[[373185.0,8019284.949381611],[639014.9492102272,8019284.949381611],[639014.9492102272,8286015.0],[373185.0,8286015.0],[373185.0,8019284.949381611]]]}
Quick conversion script:
import json
import stac_geoparquet.arrow
with open("item.json") as f:
item = json.load(f)
table = stac_geoparquet.arrow.parse_stac_items_to_arrow([item])
stac_geoparquet.arrow.to_parquet(table, "items.parquet")
Then, via gpq:
$ gpq describe items.parquet --metadata-only | jq '.columns."proj:geometry"'
{
"encoding": "WKB",
"geometry_types": [],
"crs": null
}
To Kyle's point, geometry will be WGS84. geoparquet doesn't have the concept of a "file-wide" CRS, though it does have a primary_column key in the metadata to indicate the default geometry column. @jlaura I could see two enhancements/fixes:
- Use
proj:codeor the otherprojCRS fields to set theproj:geometryCRS - Allow the user to specify the
primary_column
- Use
proj:codeor the otherprojCRS fields to set theproj:geometryCRS
This only works in the (rare?) case where all proj:geometry values in a dataset have the exact same CRS, which is why the current code doesn't even try to handle it.
Doesn't STAC mandate that the
geometryfield is always in WGS84?Are you referring to the
proj:geometryfield?
Yes, 100%. Which, is arguably wrong, because not all lat/lons are WGS84 and in the non-Earth case there is no transformation available. We simply ignore this, treat lat/lon as lat/lon, and then include the proj2.0 extension to help users. Ideally, we don't want to propagate that WGS84 hard coding out further, into the geoparquet files (see below re: the GeoParquet spec).
I think I'm following. stac-geoparquet is not setting the CRS of the
proj:geometrycolumn, which it could viaproj:code
I think that proj:code is workable, though dangerous. Doesn't that assume that the proj DB has the code in it? Perhaps proj:code or proj:wkt2 is safer? That also looks like it mirrors how Geopandas is handling this (specifically, CRS object that takes code or WKT string and then a per geometry / row WKB conversion). Based on the spec, maybe proj:projjson is best option because then this library is not having to attempt a WKT -> ProjJson or Code - > ProjJson conversion.
To Kyle's point,
geometrywill be WGS84. I think I am misunderstanding something out of the spec (and how GDAL/QGIS is working with the GeoParquet files). In the spec, I see that a column can have a projjson specified, non-WGS84 CRS, under the Column Metadata heading. Theprimary_columnthen specifies the primary geometry column. Isn't that, in effect, setting the CRS for the GeoParquet file? I fully understand that other geometry columns could have different CRS, but practically do not see a difference.
Thanks for iterating on this guys! The library is quite fun to play with.
Doesn't STAC mandate that the
geometryfield is always in WGS84? Are you referring to theproj:geometryfield?Yes, 100%. Which, is arguably wrong, because not all lat/lons are WGS84 and in the non-Earth case there is no transformation available. We simply ignore this, treat lat/lon as lat/lon, and then include the proj2.0 extension to help users. Ideally, we don't want to propagate that WGS84 hard coding out further, into the geoparquet files (see below re: the GeoParquet spec).
But the STAC spec defines it to be WGS84... I'm not sure STAC-GeoParquet should diverge from the STAC spec.
I think that
proj:codeis workable, though dangerous. Doesn't that assume that the proj DB has the code in it? Perhapsproj:codeorproj:wkt2is safer? That also looks like it mirrors how Geopandas is handling this (specifically, CRS object that takes code or WKT string and then a per geometry / row WKB conversion).
That GeoPandas code is iterating over columns, not rows. GeoParquet only allows one CRS per column.
That GeoPandas code is iterating over columns, not rows. GeoParquet only allows one CRS per column.
Understood. That is what I am trying to set, a column CRS for all of the geometries in the column.