[BUG] Reduce peak memory usage for STRUCT decoding in parquet reader
Describe the bug
In the libcudf benchmarks PARQUET_READER_NVBENCH, the STRUCT data type shows surprisingly high peak_memory_usage. For a 536 MB table, the INTEGRAL data type shows a 597 MiB peak memory usage. However, for the same 536 MB table size, the STRUCT data type shows 996 MiB peak memory usage. If there are good reasons for this difference, we can close the issue. Otherwise, we should reduce the extra memory overhead.
| data_type | io_type | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|---|---|---|---|---|---|---|---|---|---|---|---|
| INTEGRAL | DEVICE_BUFFER | 1000 | 32 | 33x | 15.405 ms | 0.30% | 15.395 ms | 0.29% | 34872906834 | 597.127 MiB | 14.403 MiB |
| FLOAT | DEVICE_BUFFER | 1000 | 32 | 51x | 9.827 ms | 0.26% | 9.818 ms | 0.24% | 54685058116 | 563.539 MiB | 9.888 MiB |
| DECIMAL | DEVICE_BUFFER | 1000 | 32 | 66x | 7.701 ms | 0.49% | 7.691 ms | 0.47% | 69802302000 | 548.740 MiB | 7.213 MiB |
| TIMESTAMP | DEVICE_BUFFER | 1000 | 32 | 1152x | 8.416 ms | 3.03% | 8.406 ms | 3.03% | 63866354457 | 556.717 MiB | 8.719 MiB |
| DURATION | DEVICE_BUFFER | 1000 | 32 | 1392x | 7.919 ms | 2.12% | 7.909 ms | 2.11% | 67879410607 | 612.525 MiB | 8.113 MiB |
| STRING | DEVICE_BUFFER | 1000 | 32 | 928x | 13.539 ms | 1.62% | 13.530 ms | 1.62% | 39678673862 | 669.530 MiB | 8.504 MiB |
| LIST | DEVICE_BUFFER | 1000 | 32 | 7x | 72.190 ms | 0.29% | 72.180 ms | 0.29% | 7437971830 | 558.376 MiB | 24.246 MiB |
| STRUCT | DEVICE_BUFFER | 1000 | 32 | 13x | 41.528 ms | 0.14% | 41.518 ms | 0.14% | 12930954541 | 996.277 MiB | 15.399 MiB |
Steps/Code to reproduce bug Here is an nvbench CLI command you can run to reproduce the above table:
./PARQUET_READER_NVBENCH --device 0 --benchmark 0 --axis cardinality=1000 --axis run_length=32
Expected behavior INTEGRAL and STRUCT<INTEGRAL> decode in the parquet reader should have a similar peak memory footprint.
Environment overview (please complete the following information)
- docker image
rapidsai/ci-conda:cuda12.1.1-ubuntu22.04-py3.11pulled on 2024-02-03 - cudf
branch-24.02and sha6cebf2294ff
Additional context
The chunked parquet reader seems to reduce the memory footprint from STRUCT decode, and the trend seems scaled to a higher footprint than other data types.
There seems to be a bug in decode_page_data, which causes double allocation of the nested string column. Somehow two out_buf objects allocate string data based on the same src_col_index.
This does not happen when there are two columns in the struct.
Got a better understanding of the isolation info: the bug happens only when the string column is the first child of the second column. Seems like this case breaks the owning_schema logic.
CC @nvdbaranec
Opened https://github.com/rapidsai/cudf/pull/15061, which fixes the peak memory use in benchmarks (structs are now in line with the memory use of their nested types).
Closed by #15061