cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Report the number of rows read per file in libcudf's Parquet reader

Open gaohao95 opened this issue 5 months ago • 2 comments

Is your feature request related to a problem? Please describe. I wish libcudf's parquet reader reports the number of rows read per file.

Consider the following example,

  std::vector<std::string> file_paths;  // defined elsewhere
  std::vector<std::string> column_names;  // defined elsewhere

  auto source  = cudf::io::source_info(file_paths);
  auto options = cudf::io::parquet_reader_options::builder(source);
  options.columns(column_names);
  auto result = cudf::io::read_parquet(options);

Here, result is of type table_with_metadata, but the metadata doesn't contain the number of rows read from each file. I wish libcudf can add this functionality.

Describe the solution you'd like Report the number of rows read from each file in table_with_metadata.

Describe alternatives you've considered I have tried cudf::io::read_parquet_metadata out-of-band, like the following snippet.

  std::vector<cudf::size_type> rows_per_file;
  rows_per_file.reserve(file_paths.size());

  for (auto const& file_path : file_paths) {
    auto file_source = cudf::io::source_info(file_path);
    auto metadata    = cudf::io::read_parquet_metadata(file_source);
    rows_per_file.push_back(metadata.num_rows());
  }
  result.rows_per_file = std::move(rows_per_file);

But this has nontrivial overhead in my use case. I believe we can get it for free as part of the Parquet reading process, since the Parquet reader needs to decode the file footers anyway.

gaohao95 avatar Mar 26 '24 06:03 gaohao95