stac-geoparquet
stac-geoparquet copied to clipboard
Fixed base item merging logic for assets
In this snippet, there's a record with None for an asset:
import planetary_computer
import adlfs
import pystac
collection = pystac.read_file("https://planetarycomputer.microsoft.com/api/stac/v1/collections/aster-l1t")
asset = planetary_computer.sign(collection.assets["geoparquet-items"])
import dask_geopandas
ddf = dask_geopandas.read_parquet(asset.href, storage_options=asset.extra_fields["table:storage_options"])
df = ddf.head()
df.assets.iloc[0]["qa-txt"] # None
This shows up on the base item in pgstac, but isn't on that actual item. It was incorrectly rehydrated.
Do you think this is the same issue?
import geopandas
import pystac
import stac_geoparquet
URL = "https://www.planet.com/data/stac/disasters/hurricane-harvey/catalog.json"
catalog = pystac.read_file(URL)
dicts = [item.to_dict() for item in catalog.get_items(recursive=True)]
df = stac_geoparquet.to_geodataframe(dicts)
assert "full-jpg" not in df.loc[0].assets
df.to_parquet(f"{catalog.id}.parq")
df = geopandas.read_parquet(f"{catalog.id}.parq")
assert "full-jpg" not in df.loc[0].assets
It feels like the serialization to parquet is causing Nones to be added to the dicts.
It looks like maybe the thing to do is to convert to string before storing arbitrary json blobs?
import json
import geopandas
import pystac
import stac_geoparquet
URL = "https://www.planet.com/data/stac/disasters/hurricane-harvey/catalog.json"
catalog = pystac.read_file(URL)
dicts = [item.to_dict() for item in catalog.get_items(recursive=True)]
df = stac_geoparquet.to_geodataframe(dicts)
df.assets = df.assets.apply(json.dumps)
assert "full-jpg" not in df.loc[0].assets
df.to_parquet(f"{catalog.id}.parq")
df = geopandas.read_parquet(f"{catalog.id}.parq")
df.assets = df.assets.apply(json.loads)
assert "full-jpg" not in df.loc[0].assets