cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Add python bindings in the parquet reader for `num_rows`/`skiprows`

Open GregoryKimball opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. Unfortunately there has been churn in libcudf around support for num_rows/skiprows in the Parquet and ORC readers. In 22.08 we deprecated these parameters in the parquet reader (#11218) and then in 22.10 we removed them from C++ (#11503) and python (#11480). We also deprecated num_rows/skiprows in the ORC reader (#11522, see issue #11519).

At this point, we realized that chunked parquet reading (#11867) would require adding num_rows/skiprows back to the C++ implementation (#11657).

Let's stabilize row selection APIs in libcudf by completing these tasks:

  • [ ] Add python bindings in the parquet reader for num_rows/skiprows
  • [ ] Remove the deprecation notice in the ORC reader for num_rows/skiprows (#11522)

Additional context We also dropped num_rows/skiprows support in the cuDF-python fuzz tests (#11505). My preference is to not include any python fuzz testing changes in the scope of this issue.

GregoryKimball avatar Feb 26 '24 19:02 GregoryKimball

Planning on implementing this as part of porting the parquet reader to pylibcudf

lithomas1 avatar Jun 17 '24 22:06 lithomas1