cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Profiling duplicate reading of metadata

Open calebwin opened this issue 3 years ago • 3 comments

The row-group-level filtered reading for Parquet that is introduced by #5843 creates an issue of duplicate metadata (metadata is stored in the footers of Parquet files) reading in the case when filters are specified. Arrow is used to read metadata and select a subset of data to read given user-provided filters [4] . Information about this subset is then passed to libcudf which reads in the subset [5]. The issue is that metadata gets read twice - first when Arrow reads metadata to do filtering and second when libcudf reads data.

This issue was initially raised here [1].

What to profile

  • [ ] Perf penalty of reading metadata using Arrow for filtering in the same vein as [2] but with datasets of varying # of files
  • [ ] Perf penalty of parsing metadata buffer [3] as fraction of total time Arrow spends reading metadata

What to determine

  • [ ] Determine whether or not perf penalty of the additional reading of metadata using Arrow is significant
  • [ ] Determine whether the duplicate reading should be resolved by passing metadata struct (steps to implement [6]) or metadata buffer (which is then parsed into metadata struct in libcudf) (steps to implement [7]) from Arrow Dataset to libcudf reader functions

Relevant discussion/code

[1] https://github.com/rapidsai/cudf/pull/5843#discussion_r467191621 [2] https://github.com/rapidsai/cudf/pull/5843#issuecomment-673566456 [3] https://github.com/apache/arrow/blob/2e6009621011d7df43882aa883905b84d1647018/cpp/src/parquet/file_reader.cc#L532 [4] https://github.com/rapidsai/cudf/pull/5843/files#diff-deac873508aaa12ca2e7c0a2c9035230R316 [5] https://github.com/rapidsai/cudf/pull/5843/files#diff-deac873508aaa12ca2e7c0a2c9035230R359-R375 [6] https://github.com/rapidsai/cudf/pull/5843#issuecomment-674437264 [7] https://github.com/rapidsai/cudf/pull/5843#issuecomment-674437709

calebwin avatar Aug 17 '20 19:08 calebwin

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Feb 16 '21 21:02 github-actions[bot]

Hello @wence- , now that #15028 is merged, would you please let me know if cuDF-python is still reading parquet row group metadata using pyarrow? Or is that step completely removed?

GregoryKimball avatar Feb 16 '24 23:02 GregoryKimball

Hello @wence- , now that #15028 is merged, would you please let me know if cuDF-python is still reading parquet row group metadata using pyarrow? Or is that step completely removed?

That change just exposed the libcudf functionality, we haven't migrated to using it from cudf-python (partly due to #15051 )

wence- avatar Feb 17 '24 10:02 wence-