cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Add more functionality to `cudf.io.read_parquet_metadata` API

Open galipremsagar opened this issue 3 years ago • 6 comments

Is your feature request related to a problem? Please describe. It would be nicer to have the row-group wise metadata returned instead of returning just the number of row-groups. That way users can identify how many rows & columns are stored in each row-group.

galipremsagar avatar Jul 07 '22 15:07 galipremsagar

@vuule Did you intended to the same pyarrow parquet schema? Like:

(Pdb) x = pq.ParquetFile(fname)
(Pdb) x.metadata
<pyarrow._parquet.FileMetaData object at 0x7fbe0a172040>
  created_by: parquet-cpp-arrow version 8.0.0
  num_columns: 15
  num_rows: 0
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 7481

(Pdb) x.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x7fbe0a172270>
  num_columns: 15
  num_rows: 0
  total_byte_size: 196

or do we want to extract only the necessary bit of details and return those?

galipremsagar avatar Jul 07 '22 15:07 galipremsagar

I like the option to follow the pyarrow metada structure here, if it's not a huge overhead to gather.

vuule avatar Jul 07 '22 15:07 vuule

Yea, should not be an issue since we already tap into this API anyways: https://github.com/rapidsai/cudf/blob/branch-22.08/python/cudf/cudf/io/parquet.py#L199

galipremsagar avatar Jul 07 '22 15:07 galipremsagar

cc: @rjzamora for visibility

galipremsagar avatar Jul 07 '22 15:07 galipremsagar

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Aug 06 '22 16:08 github-actions[bot]

That way users can identify how many rows & columns are stored in each row-group.

Now that we have read_parquet_metadata from #13663, could we re-scope this issue to specify the changes we would like to see in the parquet_metadata class?

As I understand, the number of columns will be identical for all row groups. We could add row counts for each row group as a new vector, or perhaps to metadata. Are the row group min/max statistics stored as key-value pairs in parquet_metadata.metadata?

GregoryKimball avatar Feb 16 '24 22:02 GregoryKimball