parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

PARQUET-2414: Extend BYTE_STREAM_SPLIT to support INT32, INT64 and FIXED_LEN_BYTE_ARRAY data

Open pitrou opened this issue 1 year ago • 3 comments

pitrou avatar Jan 18 '24 09:01 pitrou

+1 I think this is great. Are PoCs needed for this? I'm interested in seeing how well this works as a DELTA_BINARY_PACKED replacement for my data.

etseidl avatar Feb 29 '24 18:02 etseidl

@etseidl I've written the implementation for Parquet C++ here: https://github.com/apache/arrow/pull/40094

I was planning to implement it for Parquet Java, but you may want to do it as well.

pitrou avatar Feb 29 '24 19:02 pitrou

I was planning to implement it for Parquet Java, but you may want to do it as well.

Sounds good. I'll put it in my queue. I'll check out your arrow implementation to see if there are any pitfalls to avoid. Thanks!

etseidl avatar Feb 29 '24 19:02 etseidl

Thank you @pitrou for investigating this! Extending BYTE_STREAM_SPLIT to more data types will give us great new options in RAPIDS.

GregoryKimball avatar Mar 15 '24 18:03 GregoryKimball