data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Feature] pyarrow parquet write_table can save up to 30% storage with compression flag ‘ZSTD’

Open yuanchi2807 opened this issue 1 year ago • 2 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Library/core

Feature

https://github.com/IBM/data-prep-kit/blob/dev/data-processing-lib/python/src/data_processing/utils/transform_utils.py#L151

# convert table to bytes
            writer = pa.BufferOutputStream()
            pq.write_table(table=table, where=writer, compression='ZSTD')
            return bytes(writer.getvalue())
  

ZSTD can save up to 30% storage space compared to Snappy.

Submitted as proposed by R. Jain and M. L. Hershcovitch.

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

yuanchi2807 avatar Jul 10 '24 20:07 yuanchi2807

PR404 opened for review

yuanchi2807 avatar Jul 11 '24 13:07 yuanchi2807

I believe this is fixed in #404 and #441

daw3rd avatar Jul 31 '24 13:07 daw3rd

This is available after 0.2.0 in TransformUtils.convert_arrow_to_binary()

daw3rd avatar Sep 13 '24 16:09 daw3rd