stac-geoparquet Dictionary encode `collection` and `type`

Collection ID should always be a single value per Parquet table/dataset, so we should ensure we're dictionary-encoding it to save memory.

type: "Feature" can be removed from the table, but if we dictionary encode it, then it's barely using any memory.

Apr 23 '24 21:04 kylebarron

Do we need to do anything explicitly here?

>>> import requests
>>> import stac_geoparquet.to_arrow
>>> import stac_geoparquet.from_arrow
>>> import stac_geoparquet.to_parquet
>>> import pyarrow.parquet

>>> items = requests.get("https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items").json()["features"]
>>> table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
>>> stac_geoparquet.to_parquet.to_parquet(table, "items.parquet")

and then

In [68]: pf = pyarrow.parquet.ParquetFile("items.parquet")

In [69]: rg = pf.metadata.row_group(0)

In [70]: [x for x in rg.to_dict()["columns"] if "collection" in x["path_in_schema"]]
Out[70]:
[{'file_offset': 87251,
  'file_path': '',
  'physical_type': 'BYTE_ARRAY',
  'num_values': 10,
  'path_in_schema': 'collection',
  'is_stats_set': True,
  'statistics': {'has_min_max': True,
   'min': 'sentinel-2-l2a',
   'max': 'sentinel-2-l2a',
   'null_count': 0,
   'distinct_count': None,
   'num_values': 10,
   'physical_type': 'BYTE_ARRAY'},
  'compression': 'SNAPPY',
  'encodings': ('PLAIN', 'RLE', 'RLE_DICTIONARY'),
  'has_dictionary_page': True,
  'dictionary_page_offset': 87153,
  'data_page_offset': 87187,
  'total_compressed_size': 98,
  'total_uncompressed_size': 94}]

Does the presence of RLE_DICTIONARY in `encodings mean we're good?

Apr 24 '24 14:04 TomAugspurger

I was suggesting to dictionary encode it in memory in the Arrow type; the Parquet writer will automatically try to dictionary encode it in the file.

Apr 24 '24 14:04 kylebarron