cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] parquet gpuComputePageSizes takes a very long time to compute

Open revans2 opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. I am not an expert on decoding the parquet data. So this might just be a question or perhaps it can trigger some help looking for a way to improve the performance. We have a customer with some very deeply nested data types stored in parquet and compressed with zstd. For their three longest running queries I see gpuComputePageSizes take up about as much time as gpuDecodePageData does, and significantly more than the zstd decompression did. I also see that gpuComputePageSizes runs twice as many kernels as gpuDecodePageData does. Looking at the code I see a few comments about possible optimizations in gpuComputePageSizes but I also see that there are two passes, one to compute the "unlimited size" and a second trim pass to compute the final size taking a row range into account.

So I have two questions for you.

  1. Is there anything we can do to speed up gpuComputePageSizes itself. I am happy to provide example data as needed.
  2. For Spark we don't set _skip_rows or _num_rows as we generally always want to read the entire file. Is there a way to avoid making that second call to gpuComputePageSizes?

revans2 avatar Jun 30 '22 15:06 revans2

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Jul 30 '22 19:07 github-actions[bot]

@nvdbaranec is this done?

revans2 avatar Aug 03 '22 14:08 revans2

Followups are still in flight.

nvdbaranec avatar Aug 03 '22 15:08 nvdbaranec

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Sep 02 '22 16:09 github-actions[bot]