polars icon indicating copy to clipboard operation
polars copied to clipboard

feat(rust): streaming parquet from object_stores

Open winding-lines opened this issue 2 years ago • 3 comments

Supports reading parquet files with the object_store crate. Currently the file and s3 protocols are linked in, gcs and azure can be added if this approach makes sense.

The only supported mode is batched since generally remote files are big. Setting a big chunk size will download all the file in one iteration.

This is my first PR in this repo, feedback appreciated.

winding-lines avatar Dec 21 '22 04:12 winding-lines

@winding-lines AFAIK, scan_parquet and read_parquet can rely on fsspec filesystems to read data directly from S3, GCS and many more. so do we really need a separate read_parquet_cloud for this?

For a library point of view, it will be great for the users if they were flawlessly able to access their cloud data sets from the existing methods ?

chitralverma avatar Dec 21 '22 07:12 chitralverma

@winding-lines AFAIK, scan_parquet and read_parquet can rely on fsspec filesystems to read data directly from S3, GCS and many more. so do we really need a separate read_parquet_cloud for this?

For a library point of view, it will be great for the users if they were flawlessly able to access their cloud data sets from the existing methods ?

I personally believe that Python has achieved its amazing success in machine learning, data science and big data by providing great APIs on top of native libraries. You are right that fsspec is one such APIs, however I have concerns that it will impact performance in some use cases because of the way execution is serialized through Python. Some of these use cases:

  1. I need to open hundreds of parquet files on cloud storage - the recommended parquet size for DeltaLake is 1GB so if I have 1TB table I have 1000 files.
  2. the planning engine inside polars-rust gets more sophisticated and will leverage async downloading from multiple sources. I don't think that the Rust code should wait for python to download files on its behalf.

I have unfortunately been many times in the situation where python code needs to be scaled up through multiprocessing and I am hoping that polars will allow me to delay that point in my daily work.

There are also use cases where polars-rust could be used as-is. The delta-rs library uses object_store, this is actually where I got the idea from. Doing the downloads on the rust side will enable pure-rust applications to leverage cloud storage.

And the last point is that moving enough functionality in rust will allow polars to be used from other languages, for example Node.

winding-lines avatar Dec 21 '22 12:12 winding-lines

@winding-lines AFAIK, scan_parquet and read_parquet can rely on fsspec filesystems to read data directly from S3, GCS and many more. so do we really need a separate read_parquet_cloud for this?

For a library point of view, it will be great for the users if they were flawlessly able to access their cloud data sets from the existing methods ?

I personally believe that Python has achieved its amazing success in machine learning, data science and big data by providing great APIs on top of native libraries. You are right that fsspec is one such APIs, however I have concerns that it will impact performance in some use cases because of the way execution is serialized through Python. Some of these use cases:

  1. I need to open hundreds of parquet files on cloud storage - the recommended parquet size for DeltaLake is 1GB so if I have 1TB table I have 1000 files.
  2. the planning engine inside polars-rust gets more sophisticated and will leverage async downloading from multiple sources. I don't think that the Rust code should wait for python to download files on its behalf.

I have unfortunately been many times in the situation where python code needs to be scaled up through multiprocessing and I am hoping that polars will allow me to delay that point in my daily work.

There are also use cases where polars-rust could be used as-is. The delta-rs library uses object_store, this is actually where I got the idea from. Doing the downloads on the rust side will enable pure-rust applications to leverage cloud storage.

And the last point is that moving enough functionality in rust will allow polars to be used from other languages, for example Node.

alright then, the extension to other languages definitely makes sense! I'm just thinking about how it looks as a library for the end users.

@ritchie46 any thoughts?

chitralverma avatar Dec 21 '22 13:12 chitralverma

closes #4865

ozgrakkurt avatar Dec 23 '22 06:12 ozgrakkurt

Thanks for the feedback @ritchie46 , addressed most of it. The one exception is around the Mutex, this is from the futures library and the library doesn't seem to have a ReadWriteLock.

winding-lines avatar Dec 24 '22 16:12 winding-lines

Thanks a lot @winding-lines! Great addition.

ritchie46 avatar Dec 26 '22 11:12 ritchie46

@ozgrakkurt can you test to see if this works for you?

winding-lines avatar Dec 26 '22 23:12 winding-lines

@winding-lines @ritchie46 does this allow the following now?

  • consistent globbed paths for scan* and read* methods
  • passing array for paths for scan* and read* methods

chitralverma avatar Dec 27 '22 04:12 chitralverma

This PR only allows one cloud url. I am happy to push this further but I don’t actually know all the use cases. Could you file an issue and describe how the rust API is supposed to work?

On Mon, Dec 26, 2022 at 8:16 PM, Chitral Verma @.***> wrote:

@.(https://github.com/winding-lines) @.(https://github.com/ritchie46) does this allow the following now?

  • consistent globbed paths for scan* and read* methods
  • passing array for paths for scan* and read* methods

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

winding-lines avatar Dec 27 '22 06:12 winding-lines

@ozgrakkurt can you test to see if this works for you?

I'll try to test when I get a chance

ozgrakkurt avatar Dec 27 '22 06:12 ozgrakkurt