polars icon indicating copy to clipboard operation
polars copied to clipboard

Enable writing to cloud storage

Open winding-lines opened this issue 2 years ago • 1 comments

Problem description

Currently the object_store crate is (partially) integrated on the read path. Enable it for the write path so that files can be written directly to cloud urls.

winding-lines avatar Jan 11 '23 16:01 winding-lines

As I started to look into this problem I found this todo in the ParquetSync https://github.com/pola-rs/polars/blob/master/polars/polars-lazy/polars-pipe/src/executors/sinks/parquet_sink.rs#L62

        // TODO! speed this up by having a write thread that will make this async

We could change this implementation to always use async, both for local and remote. This will bring in tokio as a required dependency. According to crates.io the sizes are:

  • tokio 625kB
  • tokio_rustls 27kb

Async processing could be used for other IO operations in the planner, so having tokio always available could enable other speedup.

What do you think @ritchie46 ?

winding-lines avatar Jan 12 '23 12:01 winding-lines

In case anyone is interested, we added a CloudWriter that wraps ObjectStore, so ParquetWriter, CSVWriter, and friends can write directly to S3 (and other supported storages). The code can be found here: https://github.com/elixir-explorer/explorer/pull/653/files#diff-a310e751844a6e1e3a68a299a6d4d6934ae061ce4d797438c084b0ef86b53b1aR1

(Sorry for the duplicate posting, I previously posted this in the "reading from storage" issue.)

josevalim avatar Jul 28 '23 06:07 josevalim