cudf Fix chunked reads of Parquet delta encoded pages

trafficstars

Description

The chunked Parquet reader currently does not properly estimate the sizes of string pages that are delta encoded. This PR modifies gpuDecodeTotalPageStringSize() to take into account the new encodings.

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

Jan 29 '24 18:01 etseidl

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Jan 29 '24 18:01 copy-pr-bot[bot]

@nvdbaranec I'd like your opinion on how to handle the sizes. Do you want me to decode the delta lengths to get exact sizes or do you think the estimates are good enough?

Jan 29 '24 18:01 etseidl

@nvdbaranec I'd like your opinion on how to handle the sizes. Do you want me to decode the delta lengths to get exact sizes or do you think the estimates are good enough?

For the output chunking, the granularity we care about right now is just at the page level. The individual row string sizes don't matter. Is it possible to know total page string size without the full decode? Also, how good would the estimate be?

Jan 30 '24 16:01 nvdbaranec

For the output chunking, the granularity we care about right now is just at the page level. The individual row string sizes don't matter. Is it possible to know total page string size without the full decode?

For DELTA_LENGTH_BYTE_ARRAY, it's much like plain encoding, so if we know the size of the encoded lengths, we can subtract that from the page size. Finding the end isn't terribly expensive, so it's probably worthwhile to do that correctly.

For DELTA_BYTE_ARRAY it's another story. To get the string size we need to decode the suffix and prefix lengths and sum them. Trying to estimate a size is pretty much impossible otherwise. A column with a lot of repetition in it might explode in size when decoded. I'd have to do some timings, but I don't know if decoding the lengths is ruinously expensive (maybe same OOM as the current time to traverse a plain encoded page), but the decoder adds 2k to the shared memory budget.

Edit: Did a quick check of PLAIN vs DICT vs DELTA_LENGTH, and found that for a file with 76m rows of small strings, gpuComputePageSizes takes ~160ms for PLAIN, ~40ms for DICT, and about 5ms for DLBA, so no concern there. Now to wicker up a file with DELTA_BYTE_ARRAY...

Edit 2: DELTA_BYTE_ARRAY is closer to DICT than PLAIN...~32ms

Jan 30 '24 16:01 etseidl

Sounds like doing it the accurate way is the way to go. We could live with approx if it was a conservative/overestimate approx, but if it can be too small we risk actually producing too much output (which is only really an issue in the case where output chunking is used to limit column lengths, but that's an important one).

Jan 30 '24 19:01 nvdbaranec

/ok to test

Jan 31 '24 17:01 vuule

/ok to test

Feb 06 '24 23:02 vuule

/ok to test

Feb 20 '24 22:02 hyperbolic2346

/ok to test

Feb 21 '24 19:02 hyperbolic2346

/ok to test

Feb 22 '24 16:02 hyperbolic2346

/ok to test

Feb 27 '24 18:02 vuule

/ok to test

Feb 28 '24 23:02 vuule

Hope springs eternal 🥲

Feb 28 '24 23:02 etseidl

Hope springs eternal 🥲

A fix has been merged, so there ~is~was hope. I'm just not sure if it's too early.

Feb 29 '24 00:02 vuule

/ok to test

Feb 29 '24 20:02 vuule

/merge

Mar 01 '24 02:03 vuule

cudf cudf copied to clipboard

Fix chunked reads of Parquet delta encoded pages

Description

Checklist

cudf
cudf copied to clipboard