cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Fix chunked reads of Parquet delta encoded pages

Open etseidl opened this issue 1 year ago • 7 comments
trafficstars

Description

The chunked Parquet reader currently does not properly estimate the sizes of string pages that are delta encoded. This PR modifies gpuDecodeTotalPageStringSize() to take into account the new encodings.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

etseidl avatar Jan 29 '24 18:01 etseidl

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Jan 29 '24 18:01 copy-pr-bot[bot]

@nvdbaranec I'd like your opinion on how to handle the sizes. Do you want me to decode the delta lengths to get exact sizes or do you think the estimates are good enough?

etseidl avatar Jan 29 '24 18:01 etseidl

@nvdbaranec I'd like your opinion on how to handle the sizes. Do you want me to decode the delta lengths to get exact sizes or do you think the estimates are good enough?

For the output chunking, the granularity we care about right now is just at the page level. The individual row string sizes don't matter. Is it possible to know total page string size without the full decode? Also, how good would the estimate be?

nvdbaranec avatar Jan 30 '24 16:01 nvdbaranec

For the output chunking, the granularity we care about right now is just at the page level. The individual row string sizes don't matter. Is it possible to know total page string size without the full decode?

For DELTA_LENGTH_BYTE_ARRAY, it's much like plain encoding, so if we know the size of the encoded lengths, we can subtract that from the page size. Finding the end isn't terribly expensive, so it's probably worthwhile to do that correctly.

For DELTA_BYTE_ARRAY it's another story. To get the string size we need to decode the suffix and prefix lengths and sum them. Trying to estimate a size is pretty much impossible otherwise. A column with a lot of repetition in it might explode in size when decoded. I'd have to do some timings, but I don't know if decoding the lengths is ruinously expensive (maybe same OOM as the current time to traverse a plain encoded page), but the decoder adds 2k to the shared memory budget.

Edit: Did a quick check of PLAIN vs DICT vs DELTA_LENGTH, and found that for a file with 76m rows of small strings, gpuComputePageSizes takes ~160ms for PLAIN, ~40ms for DICT, and about 5ms for DLBA, so no concern there. Now to wicker up a file with DELTA_BYTE_ARRAY...

Edit 2: DELTA_BYTE_ARRAY is closer to DICT than PLAIN...~32ms

etseidl avatar Jan 30 '24 16:01 etseidl

Sounds like doing it the accurate way is the way to go. We could live with approx if it was a conservative/overestimate approx, but if it can be too small we risk actually producing too much output (which is only really an issue in the case where output chunking is used to limit column lengths, but that's an important one).

nvdbaranec avatar Jan 30 '24 19:01 nvdbaranec

/ok to test

vuule avatar Jan 31 '24 17:01 vuule

/ok to test

vuule avatar Feb 06 '24 23:02 vuule

/ok to test

hyperbolic2346 avatar Feb 20 '24 22:02 hyperbolic2346

/ok to test

hyperbolic2346 avatar Feb 21 '24 19:02 hyperbolic2346

/ok to test

hyperbolic2346 avatar Feb 22 '24 16:02 hyperbolic2346

/ok to test

vuule avatar Feb 27 '24 18:02 vuule

/ok to test

vuule avatar Feb 28 '24 23:02 vuule

Hope springs eternal 🥲

etseidl avatar Feb 28 '24 23:02 etseidl

Hope springs eternal 🥲

A fix has been merged, so there ~is~was hope. I'm just not sure if it's too early.

vuule avatar Feb 29 '24 00:02 vuule

/ok to test

vuule avatar Feb 29 '24 20:02 vuule

/merge

vuule avatar Mar 01 '24 02:03 vuule