[Enhancement] Support multiple compression formats (GZIP/SNAPPY/ZSTD/LZ4) for CSV file exports
Add GZIP/SNAPPY/ZSTD/LZ4 compression support to CSV exports via INSERT INTO FILES. Users can now specify 'compression'='gzip' to compress CSV output files, which automatically generates .csv.gz files.
Key changes:
- Add CompressedAsyncOutputStreamFile class to handle compression
- Update CSVFileWriter to support compressed output streams
- Automatically append .gz extension for compressed CSV files
- Add comprehensive unit and integration tests (97.8% coverage)
- Use BlockCompressionCodec for efficient compression
Usage example
INSERT INTO FILES(
'path' = 'file:///tmp/export/',
'format' = 'csv',
'compression' = 'gzip'
)
SELECT * FROM my_table;
Why I'm doing:
What I'm doing:
Fixes #issue
What type of PR is this:
- [ ] BugFix
- [ ] Feature
- [x] Enhancement
- [ ] Refactor
- [ ] UT
- [ ] Doc
- [ ] Tool
Does this PR entail a change in behavior?
- [x] Yes, this PR will result in a change in behavior.
- [ ] No, this PR will not result in a change in behavior.
If yes, please specify the type of change:
- [x] Interface/UI changes: syntax, type conversion, expression evaluation, display information
- [ ] Parameter changes: default values, similar parameters but with different default values
- [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
- [ ] Feature removed
- [ ] Miscellaneous: upgrade & downgrade compatibility, etc.
Checklist:
- [x] I have added test cases for my bug fix or my new feature
- [ ] This pr needs user documentation (for new or modified features or behaviors)
- [ ] I have added documentation for my new feature or new function
- [ ] This is a backport pr
Bugfix cherry-pick branch check:
- [ ] I have checked the version labels which the pr will be auto-backported to the target branch
- [ ] 4.0
- [ ] 3.5
- [ ] 3.4
- [ ] 3.3
[!NOTE] Enable CSV export compression (gzip/snappy/zstd/lz4), auto-append proper file suffixes, and add comprehensive tests.
- CSV export compression:
- Add
csv::CompressedAsyncOutputStreamFileto compress output usingGZIP/SNAPPY/ZSTD/LZ4/LZ4_FRAME.CSVFileWriterFactoryselects compressed vs uncompressed stream based oncompression_type;CSVFileWritersimplified (drops compression arg) and writes via genericcsv::OutputStream.- File naming:
FileChunkSinkProviderappends compression-specific extensions for CSV (.gz,.snappy,.zst,.lz4).- Build/System:
- Register new sources in
be/src/formats/CMakeLists.txt.- Tests:
- Add extensive unit tests for compressed/uncompressed streams and CSV writer (including custom delimiters, large data, all codecs, and invalid codec death test); update test CMake.
Written by Cursor Bugbot for commit 51e09305d37089511c75ea530dbd74e944bfc52d. This will update automatically on new commits. Configure here.
@cursor review
why not https://github.com/StarRocks/starrocks/pull/63327
🧪 CI Insights
Here's what we observed from your CI run for 51e09305.
🟢 All jobs passed!
But CI Insights is watching 👀
@cursor review
@cursor review
@cursor review
[Java-Extensions Incremental Coverage Report]
:white_check_mark: pass : 0 / 0 (0%)
[FE Incremental Coverage Report]
:white_check_mark: pass : 0 / 0 (0%)
@cursor review
[BE Incremental Coverage Report]
:white_check_mark: pass : 43 / 53 (81.13%)
file detail
| path | covered_line | new_line | coverage | not_covered_line_detail | |
|---|---|---|---|---|---|
| :large_blue_circle: | be/src/formats/csv/output_stream_file.h | 0 | 1 | 00.00% | [73] |
| :large_blue_circle: | be/src/connector/file_chunk_sink.cpp | 4 | 13 | 30.77% | [57, 58, 60, 61, 63, 64, 66, 67, 68] |
| :large_blue_circle: | be/src/formats/csv/output_stream_file.cpp | 30 | 30 | 100.00% | [] |
| :large_blue_circle: | be/src/formats/csv/csv_file_writer.cpp | 9 | 9 | 100.00% | [] |
why not #63327
@eshishki didn't see you actively interacting in the PR with comments. We have to start a new thread for the feature support.