polars
polars copied to clipboard
feat(rust): streaming parquet from object_stores
Supports reading parquet files with the object_store crate. Currently the file and s3 protocols are linked in, gcs and azure can be added if this approach makes sense.
The only supported mode is batched since generally remote files are big. Setting a big chunk size will download all the file in one iteration.
This is my first PR in this repo, feedback appreciated.
@winding-lines AFAIK, scan_parquet and read_parquet can rely on fsspec filesystems to read data directly from S3, GCS and many more. so do we really need a separate read_parquet_cloud for this?
For a library point of view, it will be great for the users if they were flawlessly able to access their cloud data sets from the existing methods ?
@winding-lines AFAIK,
scan_parquetandread_parquetcan rely onfsspecfilesystems to read data directly from S3, GCS and many more. so do we really need a separateread_parquet_cloudfor this?For a library point of view, it will be great for the users if they were flawlessly able to access their cloud data sets from the existing methods ?
I personally believe that Python has achieved its amazing success in machine learning, data science and big data by providing great APIs on top of native libraries. You are right that fsspec is one such APIs, however I have concerns that it will impact performance in some use cases because of the way execution is serialized through Python. Some of these use cases:
- I need to open hundreds of parquet files on cloud storage - the recommended parquet size for DeltaLake is 1GB so if I have 1TB table I have 1000 files.
- the planning engine inside polars-rust gets more sophisticated and will leverage async downloading from multiple sources. I don't think that the Rust code should wait for python to download files on its behalf.
I have unfortunately been many times in the situation where python code needs to be scaled up through multiprocessing and I am hoping that polars will allow me to delay that point in my daily work.
There are also use cases where polars-rust could be used as-is. The delta-rs library uses object_store, this is actually where I got the idea from. Doing the downloads on the rust side will enable pure-rust applications to leverage cloud storage.
And the last point is that moving enough functionality in rust will allow polars to be used from other languages, for example Node.
@winding-lines AFAIK,
scan_parquetandread_parquetcan rely onfsspecfilesystems to read data directly from S3, GCS and many more. so do we really need a separateread_parquet_cloudfor this?For a library point of view, it will be great for the users if they were flawlessly able to access their cloud data sets from the existing methods ?
I personally believe that Python has achieved its amazing success in machine learning, data science and big data by providing great APIs on top of native libraries. You are right that fsspec is one such APIs, however I have concerns that it will impact performance in some use cases because of the way execution is serialized through Python. Some of these use cases:
- I need to open hundreds of parquet files on cloud storage - the recommended parquet size for DeltaLake is 1GB so if I have 1TB table I have 1000 files.
- the planning engine inside polars-rust gets more sophisticated and will leverage async downloading from multiple sources. I don't think that the Rust code should wait for python to download files on its behalf.
I have unfortunately been many times in the situation where python code needs to be scaled up through multiprocessing and I am hoping that polars will allow me to delay that point in my daily work.
There are also use cases where polars-rust could be used as-is. The delta-rs library uses object_store, this is actually where I got the idea from. Doing the downloads on the rust side will enable pure-rust applications to leverage cloud storage.
And the last point is that moving enough functionality in rust will allow polars to be used from other languages, for example Node.
alright then, the extension to other languages definitely makes sense! I'm just thinking about how it looks as a library for the end users.
@ritchie46 any thoughts?
closes #4865
Thanks for the feedback @ritchie46 , addressed most of it. The one exception is around the Mutex, this is from the futures library and the library doesn't seem to have a ReadWriteLock.
Thanks a lot @winding-lines! Great addition.
@ozgrakkurt can you test to see if this works for you?
@winding-lines @ritchie46 does this allow the following now?
- consistent globbed paths for scan* and read* methods
- passing array for paths for scan* and read* methods
This PR only allows one cloud url. I am happy to push this further but I don’t actually know all the use cases. Could you file an issue and describe how the rust API is supposed to work?
On Mon, Dec 26, 2022 at 8:16 PM, Chitral Verma @.***> wrote:
@.(https://github.com/winding-lines) @.(https://github.com/ritchie46) does this allow the following now?
- consistent globbed paths for scan* and read* methods
- passing array for paths for scan* and read* methods
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
@ozgrakkurt can you test to see if this works for you?
I'll try to test when I get a chance