polars icon indicating copy to clipboard operation
polars copied to clipboard

feat(python): supports writing compressed csv.gz files

Open Mottl opened this issue 1 year ago • 8 comments
trafficstars

DataFrame.write_csv() will automatically compress data if a filename ends with .gz. Closes #13227

Mottl avatar Jun 22 '24 12:06 Mottl

Codecov Report

Attention: Patch coverage is 50.00000% with 19 lines in your changes missing coverage. Please review.

Project coverage is 80.87%. Comparing base (46ba436) to head (d3dc907). Report is 2466 commits behind head on main.

Files with missing lines Patch % Lines
py-polars/src/dataframe/io.rs 50.00% 19 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17124      +/-   ##
==========================================
+ Coverage   80.84%   80.87%   +0.02%     
==========================================
  Files        1456     1456              
  Lines      191340   191360      +20     
  Branches     2739     2739              
==========================================
+ Hits       154689   154758      +69     
+ Misses      36144    36095      -49     
  Partials      507      507              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 22 '24 13:06 codecov[bot]

Iff, we add that functionality, then it should be on the rust side. We shouldn't implement things only for python. Things I am concerned about and want to be improved for the reading part as well is doing this in a batched fashion. I don't want to one shot everything in-memory (like is the case now with reading).

ritchie46 avatar Jun 23 '24 06:06 ritchie46

Unlike python-side, rust-side doesn't take a filename as an argument for writing csv — it's a generic over std::io::Write. https://github.com/pola-rs/polars/blob/6f3c68b4920310e20fd6c78a7a9f2d947b608037/crates/polars-io/src/csv/write/writer.rs#L27-L46

Mottl avatar Jun 24 '24 02:06 Mottl

Well, I am also not a fan of automatically compressing based on the file name. It should be an opt-in keyword argument, (or setting in Rust). So that solves that.

ritchie46 avatar Jun 24 '24 11:06 ritchie46

It's your choice. To me, pandas approach is intuitive and works decent.

Mottl avatar Jun 24 '24 11:06 Mottl

It's your choice. To me, pandas approach is intuitive and works decent.

I like to have it explicit and it should also work for writing to buffers.

ritchie46 avatar Jun 24 '24 13:06 ritchie46

Iff, we add that functionality, then it should be on the rust side. We shouldn't implement things only for python. Things I am concerned about and want to be improved for the reading part as well is doing this in a batched fashion. I don't want to one shot everything in-memory (like is the case now with reading).

BGZF based compression might be a good idea for this (in short: multiple concatenated gzip files, compressed in small blocks, which allows decompressing of individual blocks). This compression format is ubiquitous in the bioinformatics field (a lot of gzipped files are BGZF compressed in bioinformatics).

In the default implementation of BGZF (de)compression (bgzip utilityin https://github.com/samtools/htslib/), libdeflateis used for compressing/decompressing chunks in parallel, which is the fastest deflate compressor (faster than zlib-ng). Optionally tabix can be used to create also an index (assuming the columns you index on are sorted).

https://samtools.github.io/hts-specs/SAMv1.pdf :

4.1 The BGZF compression format

BGZF is block compression implemented on top of the standard gzip file format.24 The goal of BGZF is to provide good compression while allowing efficient random access to the BAM file for indexed queries. The BGZF format is ‘gunzip compatible’, in the sense that a compliant gunzip utility can decompress a BGZF compressed file.25 A BGZF file is a series of concatenated BGZF blocks, each no larger than 64Kb before or after compres- sion. Each BGZF block is itself a spec-compliant gzip archive which contains an “extra field” in the format described in RFC1952. The gzip file format allows the inclusion of application-specific extra fields and these are ignored by compliant decompression implementation. The gzip specification also allows gzip files to be concatenated. The result of decompressing concatenated gzip files is the concatenation of the uncompressed data. Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:

  1. The F.EXTRA bit in the header is set to indicate that extra fields are present.
  2. The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII ‘BC’).
  3. The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of payload).
  4. The payload of the BGZF extra field is a 16-bit unsigned integer in little endian format. This integer gives the size of the containing BGZF block minus one. On disk, a complete BGZF file is a series of blocks as shown in the following table. (All integers are little endian as is required by RFC1952.)

Multiple BGZF implementations exist in rust: https://github.com/zaeleus/noodles/blob/master/noodles-bgzf/examples/bgzf_read_multithreaded.rs https://github.com/sstadick/gzp https://github.com/informationsea/bgzip-rs

ghuls avatar Jul 02 '24 08:07 ghuls

Btw, gzip supports multiple members out of the box, but without random access: flate2::read::MultiGzDecoder

Mottl avatar Jul 03 '24 05:07 Mottl

Is there an eta on this feature? I wonder if it supports cloud file systems as well, example: s3://<bucket>/my_file.gz

bn-c avatar May 01 '25 08:05 bn-c

Not really. We didn't agree with implementation. I prefer pandas approach with automatic recognition of compression from ".csv.gz" extension, while @ritchie46 wants to have a dedicated argument which turns compression on.

Mottl avatar May 01 '25 08:05 Mottl

@Mottl thanks for the quick reply.

Unfortunately, I happens to know about enterprise use cases that store gz csvs as other extensions (.data, .what-ever-we-like-to-call-it). I can agree with the aesthetics of inferring the extension, however, I think @ritchie46 has a point that there should at least be an optional way to opt-in/opt-out of the feature.

Configurable, but with sane default?

From your commit details, I assume, the only way to enable the gz feature is to have your file name end with gz?

bn-c avatar May 01 '25 09:05 bn-c

I can only imagine there is another poor soul out there, maintaining .csv.gz files but instead, they're plain text csv (or worse, a different compression algo)🤣

bn-c avatar May 01 '25 09:05 bn-c

You can use pandas, which supports csv.gz out of the box or you can convert csv to parquet after you finished writing it. There are a lot of parquet converts. The one which I use is https://github.com/Mottl/csv2pq, another option is https://github.com/domoritz/arrow-tools

Mottl avatar May 01 '25 09:05 Mottl

I think this is stale now and would require basically a completely different implementation.

coastalwhite avatar May 23 '25 09:05 coastalwhite