polars
polars copied to clipboard
Enable writing to cloud storage
Problem description
Currently the object_store crate is (partially) integrated on the read path. Enable it for the write path so that files can be written directly to cloud urls.
As I started to look into this problem I found this todo in the ParquetSync https://github.com/pola-rs/polars/blob/master/polars/polars-lazy/polars-pipe/src/executors/sinks/parquet_sink.rs#L62
// TODO! speed this up by having a write thread that will make this async
We could change this implementation to always use async, both for local and remote. This will bring in tokio as a required dependency. According to crates.io the sizes are:
- tokio 625kB
- tokio_rustls 27kb
Async processing could be used for other IO operations in the planner, so having tokio always available could enable other speedup.
What do you think @ritchie46 ?
In case anyone is interested, we added a CloudWriter that wraps ObjectStore, so ParquetWriter, CSVWriter, and friends can write directly to S3 (and other supported storages). The code can be found here: https://github.com/elixir-explorer/explorer/pull/653/files#diff-a310e751844a6e1e3a68a299a6d4d6934ae061ce4d797438c084b0ef86b53b1aR1
(Sorry for the duplicate posting, I previously posted this in the "reading from storage" issue.)