stac-geoparquet icon indicating copy to clipboard operation
stac-geoparquet copied to clipboard

Items with heterogeneous Asset keys are parsed incorrectly

Open scottyhq opened this issue 1 year ago • 2 comments

import pystac_client # 0.8.5
import stac_geoparquet  #0.6.0
import geopandas as gpd

client = pystac_client.Client.open(url='https://cmr.earthdata.nasa.gov/stac/NSIDC_ECS')

results = client.search(collections=['ATL03_006'],
                        bbox='-108.34, 38.823, -107.728, 39.19',
                        datetime='2023',
                        method='GET',
                        max_items=5,
)
items = results.item_collection()
record_batch_reader = stac_geoparquet.arrow.parse_stac_items_to_arrow(items)
gf = gpd.GeoDataFrame.from_arrow(record_batch_reader)  
gf.assets.iloc[0]

The 'data' asset keys are different for these 5 items, and every item gets a copy of the other keys with None as a value:

{'03/ATL03_20230103090928_02111806_006_02': {'href': 'https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL03.006/2023.01.03/ATL03_20230103090928_02111806_006_02.h5',
  'roles': array(['data'], dtype=object),
  'title': 'Direct Download'},
 '05/ATL03_20230205073720_07141806_006_02': None,
 '06/ATL03_20230206192127_07371802_006_02': None,
 '06/ATL03_20230306061322_11561806_006_02': None,
 '08/ATL03_20230108204519_02951802_006_02': None,
 'browse': {'href': 'https://n5eil01u.ecs.nsidc.org/DP0/BRWS/Browse.001/2024.04.08/ATL03_20230103090928_02111806_006_02_BRW.h5.images.tide_pole.jpg',
  'roles': array(['browse'], dtype=object),
  'title': 'Download ATL03_20230103090928_02111806_006_02_BRW.h5.images.tide_pole.jpg',
  'type': 'image/jpeg'},

These None entries prevent going back from a dataframe to pystac items:

import pystac
batch = stac_geoparquet.arrow.stac_table_to_items(gf.to_arrow())
items = [pystac.Item.from_dict(x) for x in batch]
File ~/GitHub/uw-cryo/coincident/.pixi/envs/dev/lib/python3.12/site-packages/pystac/asset.py:199, in Asset.from_dict(cls, d)
    193 """Constructs an Asset from a dict.
    194 
    195 Returns:
    196     Asset: The Asset deserialized from the JSON dict.
    197 """
    198 d = copy(d)
--> 199 href = d.pop("href")
    200 media_type = d.pop("type", None)
    201 title = d.pop("title", None)

AttributeError: 'NoneType' object has no attribute 'pop'

scottyhq avatar Nov 04 '24 09:11 scottyhq

In general this is a limitation of Parquet.

JSON has three states: a valid value, null, and a missing/undefined key. Because Parquet is columnar, the third option does not exist here. If one key exists for any item, the entire column for that key name is provisioned.

The default arrow serialization emits None for null arrow values. There was some discussion about this on an issue previously. Perhaps we could add a keyword parameter to stac_table_to_items to remove keys with None. But this would be difficult to do reliably, especially when None is required for some other keys, like datetime, to mean that it has start/end datetime instead.

kylebarron avatar Nov 04 '24 13:11 kylebarron

Right, there is this similar issue for when an outlier Item in a collection is missing an asset common to the group like 'thumbnail' https://github.com/stac-utils/stac-geoparquet/issues/77.

add a keyword parameter to stac_table_to_items to remove keys with None

Seems convenient. I wonder if it wouldn't be too tricky to only apply to the 'assets' column to avoid complications with datetime?

My quick workaround currently is to just filter the assets column after loading to geopandas:

def filter_assets(assets):
    """ Remove key:None pairs from assets """
    keep_keys = []
    for k,v in assets.items():
        if v is not None:
            keep_keys.append(k)

    return {key: assets[key] for key in keep_keys}

gf['assets'] = gf['assets'].apply(filter_assets)

scottyhq avatar Nov 04 '24 14:11 scottyhq