kaskada
kaskada copied to clipboard
feat: Split output into multiple files
Summary Rather than producing a single Parquet file containing the entire result set, we should split the results into files.
There are two reasons -- separate partitions should be able to write separate files and large results should be able to roll into multiple files, allowing the index columns to be written out and dropped, etc.
The API already supports this, but the Python client (and other places) likely don't have all the plumbing in place.
For an initial pass, it is likely OK to have the Python client download all files and combine them to a single data frame, but this can (and should) evolve over time to allow paging over the files (eg., fetch the first file and turn that into a data frame) and/or streaming support (fetch files as they are available), etc.
- [ ] Have the Parquet sink rotate files every N (~1,000,000 rows or so)
- [ ] Verify everything works when producing multiple files
This is likely necessary to make maximal use of partitioned execution.
I wrote up a Python Client proposal design doc here: https://docs.google.com/document/d/1CHTiyLDD52FpwSI-SEhqft9HT1bYB-2WFTrCNrxqC0w/edit?usp=sharing
My latest PR (#495) should write multiple files. @kevinjnguyen once that goes in, would you be able to verify everything is working with the python client support?