parquet-testing icon indicating copy to clipboard operation
parquet-testing copied to clipboard

Add test file with sorting columns

Open mapleFU opened this issue 1 year ago • 5 comments
trafficstars

Generate script: (with pyarrow 16.1.0 )

>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> sorting_columns = (pq.SortingColumn(column_index=0, descending=True, nulls_first=True), pq.SortingColumn(column_index=1, descending=False))
>>> table = pa.table({'a': [None, 2, 1], 'b': ['a', 'b', 'c']})
>>> with pq.ParquetWriter('sorting_columns.parquet', table.schema, sorting_columns=sorting_columns) as writer:
...     for i in range(2):
...             writer.write_table(table)

File Metadata:

{
  "Version": "2.6",
  "CreatedBy": "parquet-cpp-arrow version 16.1.0",
  "TotalRows": "6",
  "NumberOfRowGroups": "2",
  "NumberOfRealColumns": "2",
  "NumberOfColumns": "2",
  "Columns": [
     { "Id": "0", "Name": "a", "PhysicalType": "INT64", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
     { "Id": "1", "Name": "b", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} }
  ],
  "RowGroups": [
     {
       "Id": "0",  "TotalBytes": "166",  "TotalCompressedBytes": "174",  "SortColumns": [{"column_idx":0, "descending":1, "nulls_first": 1}, {"column_idx":1, "descending":0, "nulls_first": 0}],  "Rows": "3",
       "ColumnChunks": [
          {"Id": "0", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "1", "Max": "2", "Min": "1" },
           "Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "100", "CompressedSize": "104" },
          {"Id": "1", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "0", "Max": "c", "Min": "a" },
           "Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "66", "CompressedSize": "70" }
        ]
     },
     {
       "Id": "1",  "TotalBytes": "166",  "TotalCompressedBytes": "174",  "SortColumns": [{"column_idx":0, "descending":1, "nulls_first": 1}, {"column_idx":1, "descending":0, "nulls_first": 0}],  "Rows": "3",
       "ColumnChunks": [
          {"Id": "0", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "1", "Max": "2", "Min": "1" },
           "Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "100", "CompressedSize": "104" },
          {"Id": "1", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "0", "Max": "c", "Min": "a" },
           "Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "66", "CompressedSize": "70" }
        ]
     }
  ]
}

mapleFU avatar Aug 07 '24 07:08 mapleFU

cc @wgtmac @pitrou

mapleFU avatar Aug 07 '24 07:08 mapleFU

If might be useful to have multiple row groups.

wgtmac avatar Aug 07 '24 16:08 wgtmac

If might be useful to have multiple row groups.

Currently, we're not able to create file with different sort-props in multiple row-groups. Should I first implement that or just same sort-properties in two row-groups?

mapleFU avatar Aug 08 '24 04:08 mapleFU

I think it is fine to use the same properties across all row groups.

wgtmac avatar Aug 08 '24 06:08 wgtmac

@wgtmac @pitrou

I've change to use 2 row-groups with same data. Pr description is changed for this

mapleFU avatar Aug 09 '24 06:08 mapleFU

I would merge this first. Feel free to edit or overwrite this if has any issue on this file or description.

mapleFU avatar Aug 13 '24 16:08 mapleFU