[FEA] Add more functionality to `cudf.io.read_parquet_metadata` API
Is your feature request related to a problem? Please describe. It would be nicer to have the row-group wise metadata returned instead of returning just the number of row-groups. That way users can identify how many rows & columns are stored in each row-group.
@vuule Did you intended to the same pyarrow parquet schema? Like:
(Pdb) x = pq.ParquetFile(fname)
(Pdb) x.metadata
<pyarrow._parquet.FileMetaData object at 0x7fbe0a172040>
created_by: parquet-cpp-arrow version 8.0.0
num_columns: 15
num_rows: 0
num_row_groups: 1
format_version: 1.0
serialized_size: 7481
(Pdb) x.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x7fbe0a172270>
num_columns: 15
num_rows: 0
total_byte_size: 196
or do we want to extract only the necessary bit of details and return those?
I like the option to follow the pyarrow metada structure here, if it's not a huge overhead to gather.
Yea, should not be an issue since we already tap into this API anyways: https://github.com/rapidsai/cudf/blob/branch-22.08/python/cudf/cudf/io/parquet.py#L199
cc: @rjzamora for visibility
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
That way users can identify how many rows & columns are stored in each row-group.
Now that we have read_parquet_metadata from #13663, could we re-scope this issue to specify the changes we would like to see in the parquet_metadata class?
As I understand, the number of columns will be identical for all row groups. We could add row counts for each row group as a new vector, or perhaps to metadata. Are the row group min/max statistics stored as key-value pairs in parquet_metadata.metadata?