stac-geoparquet
stac-geoparquet copied to clipboard
Dictionary encode `collection` and `type`
Collection ID should always be a single value per Parquet table/dataset, so we should ensure we're dictionary-encoding it to save memory.
type: "Feature" can be removed from the table, but if we dictionary encode it, then it's barely using any memory.
Do we need to do anything explicitly here?
>>> import requests
>>> import stac_geoparquet.to_arrow
>>> import stac_geoparquet.from_arrow
>>> import stac_geoparquet.to_parquet
>>> import pyarrow.parquet
>>> items = requests.get("https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items").json()["features"]
>>> table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
>>> stac_geoparquet.to_parquet.to_parquet(table, "items.parquet")
and then
In [68]: pf = pyarrow.parquet.ParquetFile("items.parquet")
In [69]: rg = pf.metadata.row_group(0)
In [70]: [x for x in rg.to_dict()["columns"] if "collection" in x["path_in_schema"]]
Out[70]:
[{'file_offset': 87251,
'file_path': '',
'physical_type': 'BYTE_ARRAY',
'num_values': 10,
'path_in_schema': 'collection',
'is_stats_set': True,
'statistics': {'has_min_max': True,
'min': 'sentinel-2-l2a',
'max': 'sentinel-2-l2a',
'null_count': 0,
'distinct_count': None,
'num_values': 10,
'physical_type': 'BYTE_ARRAY'},
'compression': 'SNAPPY',
'encodings': ('PLAIN', 'RLE', 'RLE_DICTIONARY'),
'has_dictionary_page': True,
'dictionary_page_offset': 87153,
'data_page_offset': 87187,
'total_compressed_size': 98,
'total_uncompressed_size': 94}]
Does the presence of RLE_DICTIONARY in `encodings mean we're good?
I was suggesting to dictionary encode it in memory in the Arrow type; the Parquet writer will automatically try to dictionary encode it in the file.