kaskada icon indicating copy to clipboard operation
kaskada copied to clipboard

feat: Split output into multiple files

Open bjchambers opened this issue 2 years ago • 3 comments
trafficstars

Summary Rather than producing a single Parquet file containing the entire result set, we should split the results into files.

There are two reasons -- separate partitions should be able to write separate files and large results should be able to roll into multiple files, allowing the index columns to be written out and dropped, etc.

The API already supports this, but the Python client (and other places) likely don't have all the plumbing in place.

For an initial pass, it is likely OK to have the Python client download all files and combine them to a single data frame, but this can (and should) evolve over time to allow paging over the files (eg., fetch the first file and turn that into a data frame) and/or streaming support (fetch files as they are available), etc.

  • [ ] Have the Parquet sink rotate files every N (~1,000,000 rows or so)
  • [ ] Verify everything works when producing multiple files

bjchambers avatar Jun 30 '23 16:06 bjchambers

This is likely necessary to make maximal use of partitioned execution.

bjchambers avatar Jun 30 '23 16:06 bjchambers

I wrote up a Python Client proposal design doc here: https://docs.google.com/document/d/1CHTiyLDD52FpwSI-SEhqft9HT1bYB-2WFTrCNrxqC0w/edit?usp=sharing

kevinjnguyen avatar Jul 05 '23 22:07 kevinjnguyen

My latest PR (#495) should write multiple files. @kevinjnguyen once that goes in, would you be able to verify everything is working with the python client support?

bjchambers avatar Jul 10 '23 21:07 bjchambers