polars
polars copied to clipboard
feat(python): supports writing compressed csv.gz files
DataFrame.write_csv() will automatically compress data if a filename ends with .gz.
Closes #13227
Codecov Report
Attention: Patch coverage is 50.00000% with 19 lines in your changes missing coverage. Please review.
Project coverage is 80.87%. Comparing base (
46ba436) to head (d3dc907). Report is 2466 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| py-polars/src/dataframe/io.rs | 50.00% | 19 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #17124 +/- ##
==========================================
+ Coverage 80.84% 80.87% +0.02%
==========================================
Files 1456 1456
Lines 191340 191360 +20
Branches 2739 2739
==========================================
+ Hits 154689 154758 +69
+ Misses 36144 36095 -49
Partials 507 507
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Iff, we add that functionality, then it should be on the rust side. We shouldn't implement things only for python. Things I am concerned about and want to be improved for the reading part as well is doing this in a batched fashion. I don't want to one shot everything in-memory (like is the case now with reading).
Unlike python-side, rust-side doesn't take a filename as an argument for writing csv — it's a generic over std::io::Write.
https://github.com/pola-rs/polars/blob/6f3c68b4920310e20fd6c78a7a9f2d947b608037/crates/polars-io/src/csv/write/writer.rs#L27-L46
Well, I am also not a fan of automatically compressing based on the file name. It should be an opt-in keyword argument, (or setting in Rust). So that solves that.
It's your choice. To me, pandas approach is intuitive and works decent.
It's your choice. To me, pandas approach is intuitive and works decent.
I like to have it explicit and it should also work for writing to buffers.
Iff, we add that functionality, then it should be on the rust side. We shouldn't implement things only for python. Things I am concerned about and want to be improved for the reading part as well is doing this in a batched fashion. I don't want to one shot everything in-memory (like is the case now with reading).
BGZF based compression might be a good idea for this (in short: multiple concatenated gzip files, compressed in small blocks, which allows decompressing of individual blocks). This compression format is ubiquitous in the bioinformatics field (a lot of gzipped files are BGZF compressed in bioinformatics).
In the default implementation of BGZF (de)compression (bgzip utilityin https://github.com/samtools/htslib/), libdeflateis used for compressing/decompressing chunks in parallel, which is the fastest deflate compressor (faster than zlib-ng). Optionally tabix can be used to create also an index (assuming the columns you index on are sorted).
https://samtools.github.io/hts-specs/SAMv1.pdf :
4.1 The BGZF compression format
BGZF is block compression implemented on top of the standard gzip file format.24 The goal of BGZF is to provide good compression while allowing efficient random access to the BAM file for indexed queries. The BGZF format is ‘gunzip compatible’, in the sense that a compliant gunzip utility can decompress a BGZF compressed file.25 A BGZF file is a series of concatenated BGZF blocks, each no larger than 64Kb before or after compres- sion. Each BGZF block is itself a spec-compliant gzip archive which contains an “extra field” in the format described in RFC1952. The gzip file format allows the inclusion of application-specific extra fields and these are ignored by compliant decompression implementation. The gzip specification also allows gzip files to be concatenated. The result of decompressing concatenated gzip files is the concatenation of the uncompressed data. Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:
- The F.EXTRA bit in the header is set to indicate that extra fields are present.
- The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII ‘BC’).
- The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of payload).
- The payload of the BGZF extra field is a 16-bit unsigned integer in little endian format. This integer gives the size of the containing BGZF block minus one. On disk, a complete BGZF file is a series of blocks as shown in the following table. (All integers are little endian as is required by RFC1952.)
Multiple BGZF implementations exist in rust: https://github.com/zaeleus/noodles/blob/master/noodles-bgzf/examples/bgzf_read_multithreaded.rs https://github.com/sstadick/gzp https://github.com/informationsea/bgzip-rs
Btw, gzip supports multiple members out of the box, but without random access: flate2::read::MultiGzDecoder
Is there an eta on this feature? I wonder if it supports cloud file systems as well, example: s3://<bucket>/my_file.gz
Not really. We didn't agree with implementation. I prefer pandas approach with automatic recognition of compression from ".csv.gz" extension, while @ritchie46 wants to have a dedicated argument which turns compression on.
@Mottl thanks for the quick reply.
Unfortunately, I happens to know about enterprise use cases that store gz csvs as other extensions (.data, .what-ever-we-like-to-call-it). I can agree with the aesthetics of inferring the extension, however, I think @ritchie46 has a point that there should at least be an optional way to opt-in/opt-out of the feature.
Configurable, but with sane default?
From your commit details, I assume, the only way to enable the gz feature is to have your file name end with gz?
I can only imagine there is another poor soul out there, maintaining .csv.gz files but instead, they're plain text csv (or worse, a different compression algo)🤣
You can use pandas, which supports csv.gz out of the box or you can convert csv to parquet after you finished writing it. There are a lot of parquet converts. The one which I use is https://github.com/Mottl/csv2pq, another option is https://github.com/domoritz/arrow-tools
I think this is stale now and would require basically a completely different implementation.