starrocks icon indicating copy to clipboard operation
starrocks copied to clipboard

[Enhancement] Support multiple compression formats (GZIP/SNAPPY/ZSTD/LZ4) for CSV file exports

Open tracymacding opened this issue 3 weeks ago • 8 comments

Add GZIP/SNAPPY/ZSTD/LZ4 compression support to CSV exports via INSERT INTO FILES. Users can now specify 'compression'='gzip' to compress CSV output files, which automatically generates .csv.gz files.

Key changes:

  1. Add CompressedAsyncOutputStreamFile class to handle compression
  2. Update CSVFileWriter to support compressed output streams
  3. Automatically append .gz extension for compressed CSV files
  4. Add comprehensive unit and integration tests (97.8% coverage)
  5. Use BlockCompressionCodec for efficient compression

Usage example

INSERT INTO FILES(
    'path' = 'file:///tmp/export/',
    'format' = 'csv',
    'compression' = 'gzip'
)
SELECT * FROM my_table;

Why I'm doing:

What I'm doing:

Fixes #issue

What type of PR is this:

  • [ ] BugFix
  • [ ] Feature
  • [x] Enhancement
  • [ ] Refactor
  • [ ] UT
  • [ ] Doc
  • [ ] Tool

Does this PR entail a change in behavior?

  • [x] Yes, this PR will result in a change in behavior.
  • [ ] No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • [x] Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • [ ] Parameter changes: default values, similar parameters but with different default values
  • [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
  • [ ] Feature removed
  • [ ] Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • [x] I have added test cases for my bug fix or my new feature
  • [ ] This pr needs user documentation (for new or modified features or behaviors)
    • [ ] I have added documentation for my new feature or new function
  • [ ] This is a backport pr

Bugfix cherry-pick branch check:

  • [ ] I have checked the version labels which the pr will be auto-backported to the target branch
    • [ ] 4.0
    • [ ] 3.5
    • [ ] 3.4
    • [ ] 3.3

[!NOTE] Enable CSV export compression (gzip/snappy/zstd/lz4), auto-append proper file suffixes, and add comprehensive tests.

  • CSV export compression:
    • Add csv::CompressedAsyncOutputStreamFile to compress output using GZIP/SNAPPY/ZSTD/LZ4/LZ4_FRAME.
    • CSVFileWriterFactory selects compressed vs uncompressed stream based on compression_type; CSVFileWriter simplified (drops compression arg) and writes via generic csv::OutputStream.
  • File naming:
    • FileChunkSinkProvider appends compression-specific extensions for CSV (.gz, .snappy, .zst, .lz4).
  • Build/System:
    • Register new sources in be/src/formats/CMakeLists.txt.
  • Tests:
    • Add extensive unit tests for compressed/uncompressed streams and CSV writer (including custom delimiters, large data, all codecs, and invalid codec death test); update test CMake.

Written by Cursor Bugbot for commit 51e09305d37089511c75ea530dbd74e944bfc52d. This will update automatically on new commits. Configure here.

tracymacding avatar Dec 03 '25 09:12 tracymacding

@cursor review

alvin-celerdata avatar Dec 03 '25 17:12 alvin-celerdata

why not https://github.com/StarRocks/starrocks/pull/63327

eshishki avatar Dec 03 '25 18:12 eshishki

🧪 CI Insights

Here's what we observed from your CI run for 51e09305.

🟢 All jobs passed!

But CI Insights is watching 👀

mergify[bot] avatar Dec 08 '25 07:12 mergify[bot]

@cursor review

alvin-celerdata avatar Dec 08 '25 17:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 11 '25 15:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 16 '25 17:12 alvin-celerdata

[Java-Extensions Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] avatar Dec 17 '25 04:12 github-actions[bot]

[FE Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] avatar Dec 17 '25 04:12 github-actions[bot]

@cursor review

alvin-celerdata avatar Dec 17 '25 05:12 alvin-celerdata

[BE Incremental Coverage Report]

:white_check_mark: pass : 43 / 53 (81.13%)

file detail

path covered_line new_line coverage not_covered_line_detail
:large_blue_circle: be/src/formats/csv/output_stream_file.h 0 1 00.00% [73]
:large_blue_circle: be/src/connector/file_chunk_sink.cpp 4 13 30.77% [57, 58, 60, 61, 63, 64, 66, 67, 68]
:large_blue_circle: be/src/formats/csv/output_stream_file.cpp 30 30 100.00% []
:large_blue_circle: be/src/formats/csv/csv_file_writer.cpp 9 9 100.00% []

github-actions[bot] avatar Dec 17 '25 05:12 github-actions[bot]

why not #63327

@eshishki didn't see you actively interacting in the PR with comments. We have to start a new thread for the feature support.

kevincai avatar Dec 21 '25 12:12 kevincai