cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Cache schema tree in `chunked_parquet_writer`

Open Matt711 opened this issue 9 months ago • 3 comments

Is your feature request related to a problem? Please describe. When using cudf::io::chunked_parquet_writer, the schema tree is recomputed for every chunk passed to writer.write(). This introduces unnecessary overhead when writing many chunks with the same schema.

Describe the solution you'd like Cache the schema tree after the first write() call and reuse it for subsequent chunks, assuming the schema is identical.

Describe alternatives you've considered Continue with the status-quo. Ie. recomputing the schema tree for every chunk. Still inefficient for workloads with a high number of chunks.

Additional context libcudf chunked write benchmark for reference: https://github.com/rapidsai/cudf/pull/19015#issuecomment-2922368382

Matt711 avatar Jun 02 '25 14:06 Matt711

CC @mhaseeb123 @vuule

Matt711 avatar Jun 02 '25 14:06 Matt711

We would still want to check that the schemas match, right? Is that much faster than building the schema tree?

vuule avatar Jun 02 '25 16:06 vuule

We would still want to check that the schemas match, right? Is that much faster than building the schema tree?

We do that now as well (when we reuse the aggregate_writer_metadata) so with caching we won't have to do that anymore

mhaseeb123 avatar Jun 02 '25 19:06 mhaseeb123