stac-geoparquet Fixed base item merging logic for assets

In this snippet, there's a record with None for an asset:

import planetary_computer
import adlfs
import pystac

collection = pystac.read_file("https://planetarycomputer.microsoft.com/api/stac/v1/collections/aster-l1t")
asset = planetary_computer.sign(collection.assets["geoparquet-items"])

import dask_geopandas

ddf = dask_geopandas.read_parquet(asset.href, storage_options=asset.extra_fields["table:storage_options"])
df = ddf.head()

df.assets.iloc[0]["qa-txt"]  # None

This shows up on the base item in pgstac, but isn't on that actual item. It was incorrectly rehydrated.

Sep 23 '22 16:09 TomAugspurger

Do you think this is the same issue?

import geopandas
import pystac
import stac_geoparquet

URL = "https://www.planet.com/data/stac/disasters/hurricane-harvey/catalog.json"
catalog = pystac.read_file(URL)
dicts = [item.to_dict() for item in catalog.get_items(recursive=True)]
df =  stac_geoparquet.to_geodataframe(dicts)
assert "full-jpg" not in df.loc[0].assets

df.to_parquet(f"{catalog.id}.parq")
df = geopandas.read_parquet(f"{catalog.id}.parq")
assert "full-jpg" not in df.loc[0].assets

It feels like the serialization to parquet is causing Nones to be added to the dicts.

Aug 11 '23 20:08 jsignell

It looks like maybe the thing to do is to convert to string before storing arbitrary json blobs?

import json
import geopandas
import pystac
import stac_geoparquet

URL = "https://www.planet.com/data/stac/disasters/hurricane-harvey/catalog.json"
catalog = pystac.read_file(URL)
dicts = [item.to_dict() for item in catalog.get_items(recursive=True)]
df =  stac_geoparquet.to_geodataframe(dicts)
df.assets = df.assets.apply(json.dumps)
assert "full-jpg" not in df.loc[0].assets

df.to_parquet(f"{catalog.id}.parq")
df = geopandas.read_parquet(f"{catalog.id}.parq")
df.assets = df.assets.apply(json.loads)
assert "full-jpg" not in df.loc[0].assets

Aug 11 '23 20:08 jsignell