polars
polars copied to clipboard
Polars panics when Azure Storage URI is used
What language are you using?
Rust
Which feature gates did you use?
"polars-io", "parquet", "lazy", "dtype-struct"
Have you tried latest version of polars?
- [yes]
What version of polars are you using?
0.22.8
What operating system are you using polars on?
macOS Monterey 12.3.1
What language version are you using
$ rustc --version
rustc 1.64.0-nightly (495b21669 2022-07-03)
$ cargo --version
cargo 1.64.0-nightly (dbff32b27 2022-06-24)
Describe your bug.
Using Azure Storage URIs like abss://[email protected]/my/adls/folder/my_file.parquet results in panic with "No such file or directory" message.
What are the steps to reproduce the behavior?
Given a parquet file located on an Azure Storage account
When the following code is executed:
let df = LazyFrame::scan_parquet(
"abfss://[email protected]/my/adls/folder/my_file.parquet".to_string(),
ScanArgsParquet::default())
.unwrap()
.select([all()])
.collect()
.unwrap();
It panics with
thread 'main' panicked at 'called `Result::unwrap()` on
an `Err` value: Io(Os { code: 2, kind: NotFound, message: "No such file or directory" })'
What is the actual behavior?
The error is:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Io(Os { code: 2, kind: NotFound, message: "No such file or directory" })',
src/main.rs:9:10
stack backtrace:
0: 0x110661b24 - std::backtrace_rs::backtrace::libunwind::trace::h515e8409b092ccae
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
1: 0x110661b24 - std::backtrace_rs::backtrace::trace_unsynchronized::haadb9478c42974b3
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x110661b24 - std::sys_common::backtrace::_print_fmt::h582ca8eb4769ca98
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:66:5
3: 0x110661b24 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5739599580de7c03
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:45:22
4: 0x110680bbb - core::fmt::write::hdbd915c356b4a35c
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/fmt/mod.rs:1198:17
5: 0x11065e69c - std::io::Write::write_fmt::hedec9ebe64f68a8c
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/io/mod.rs:1672:15
6: 0x110663237 - std::sys_common::backtrace::_print::hbba57d8ca7ac5872
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:48:5
7: 0x110663237 - std::sys_common::backtrace::print::hf73d56375edacf0a
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:35:9
8: 0x110663237 - std::panicking::default_hook::{{closure}}::h6ea7fabe4546dbbe
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:295:22
9: 0x110662f40 - std::panicking::default_hook::h3b0f5c43f1cc1cb7
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:314:9
10: 0x110663861 - std::panicking::rust_panic_with_hook::h38e58db141a96cd6
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:698:17
11: 0x1106637a3 - std::panicking::begin_panic_handler::{{closure}}::h69d5f77924609c80
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:588:13
12: 0x110661fd7 - std::sys_common::backtrace::__rust_end_short_backtrace::ha6c7f778cf12b0cc
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:138:18
13: 0x11066347d - rust_begin_unwind
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:584:5
14: 0x11073ee03 - core::panicking::panic_fmt::h7d9122ca971122ab
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/panicking.rs:142:14
15: 0x11073ef65 - core::result::unwrap_failed::h08cd9ea7c15b7964
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/result.rs:1805:5
16: 0x10d652612 - core::result::Result<T,E>::unwrap::h77fd2c61b74f7953
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/result.rs:1098:23
17: 0x10d652aef - gyrfalcon::main::h47681dcf9707e8b6
at /.../rust-polars-delta/gyrfalcon/src/main.rs:6:14
18: 0x10d652835 - core::ops::function::FnOnce::call_once::h64fe34f6f736f7dd
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/ops/function.rs:248:5
19: 0x10d652e48 - std::sys_common::backtrace::__rust_begin_short_backtrace::he874fbdf51352f9a
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:122:18
20: 0x10d652eb8 - std::rt::lang_start::{{closure}}::hfba7ca7366433629
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/rt.rs:145:18
21: 0x1106596a7 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h324365c9de115800
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/ops/function.rs:280:13
22: 0x1106596a7 - std::panicking::try::do_call::h984b8436f15c2b0a
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:492:40
23: 0x1106596a7 - std::panicking::try::h4851f4f0c47f2f24
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:456:19
24: 0x1106596a7 - std::panic::catch_unwind::h343fc9d072428d4a
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panic.rs:137:14
25: 0x1106596a7 - std::rt::lang_start_internal::{{closure}}::hadf732d92a9472cb
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/rt.rs:128:48
26: 0x1106596a7 - std::panicking::try::do_call::h82e6b2bfab907999
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:492:40
27: 0x1106596a7 - std::panicking::try::h53c3237109262e5b
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:456:19
28: 0x1106596a7 - std::panic::catch_unwind::h2b03f33cc8b8f4c2
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panic.rs:137:14
29: 0x1106596a7 - std::rt::lang_start_internal::hba2917d8cf49a187
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/rt.rs:128:20
30: 0x10d652e8e - std::rt::lang_start::h67b7db7772698b4e
at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/rt.rs:144:17
31: 0x10d652e26 - _main
What is the expected behavior?
To properly load the parquet file stored on an Azure Storage account.
We only support local file systems. Remote storage blob is not in scope for the project (yet).
@ritchie46 is there a feature request opened for this? Polars can benefit a lot from the multiple use cases with external storages like S3, Azure Storage, Google Cloud Storage and other HDFS compatible storages.
Doesn't Polars support reading from Read + Seek? There is a semantic relationship between Read + Seek and byte-range headers supported by azure/s3/gcp.
Doesn't Polars support reading from Read + Seek?
For csv and parquet we use memmap.
There is a semantic relationship between Read + Seek and byte-range headers supported by azure/s3/gcp.
Are there already abstraction that map to appropriate cloud/http calls?
Currently, I don't yet see much benefit from supporting this vs a user adding a cloud provider crate and downloading the specific in memory. We are not yet able to benefit from any pushed down operations such as projections/predicates/slices.
We first need to have proper support for anonymous readers and then a way to deal with the pushed down optimizations such that we save network IO. But given that there are many different providers I'd want something standardized before building this.
@ritchie46 Do you have any example on how to integrate any external storage library with Polars? Given I have 1Gb parquet files stored on external storage like S3 or Azure Storage, how can I efficiently get only the chunk of data that Polars needs in the query? Downloading the whole 1Gb and putting it in memory it would be very costly both time and money.
Downloading the file once and cache it. We are unable to do anything smart with blob storage. In the python bindings we use fsspec to get cloud files. That gives us an abstraction that looks like a file handle, but there we also just load it into memory and I would recommend to download the files.
@ritchie46 , wouldn't it be possible to abstract a Read + Seek to offer support for it? Note that this does not need to be async - sync works already - users can push the IO-bounded task to a separate thread pool if they want (by writing a custom struct that implements Read + Seek and calls https from that pool).
I understand @andrei-ionescu 's point that projection and filter pushdown would be advantageous in Rust. We already support a file descriptor in Python (which is basically dyn Read + Seek in Rust), so I think that supporting Read + Seek (even if dyn Read + Seek), could make sense?
But I may be missing something here - isn't the parquet reader code using std::fs::File and passing it to arrow2? Something like
enum Input {
PathOrGlob(String),
File(Box<dyn Read + Seek + ...>),
}
and branch it when reading?
But I may be missing something here - isn't the parquet reader code using std::fs::File and passing it to arrow2? Something like
enum Input {
PathOrGlob(String),
File(Box<dyn Read + Seek + ...>),
}
We could do this branch indeed. :+1:
The thing I am afraid for is implementing this Box<dyn Read + Seek> that represents a cloud provider. I was under the assumption that this is a lot of work, but maybe somebody has some ideas on that.
@ritchie46, @jorgecarleitao: Should we transform this in feature then?
@ritchie46, @jorgecarleitao: Should we transform this in feature then?
Jup.
@andrei-ionescu do you know if there is a dyn Read + Seek abstraction crate/ implementation for azure blob storage? I don't have any experience with azure.
We have something like that here for s3: https://github.com/DataEngineeringLabs/ranged-reader-rs/blob/main/tests/it/parquet_s3.rs#L6
for azure we need to find how to run range-bytes queries with its Rust API, i.e. the blocking version of this: https://github.com/Azure/azure-sdk-for-rust/blob/main/sdk/storage_blobs/examples/blob_range.rs#L42
You also have this crate, but I am not sure it supports everything needed: https://docs.rs/object_store/latest/object_store/
The Object Store library is currently missing support for Azure Data Lake Storage Gen2. So, abfss protocol won't work.
I created this ticket — https://github.com/apache/arrow-rs/issues/3283 — on the arrow-rs/objects_store repo to have the support for ADLS Gen2 added.
I'd like to suggest that Polars start by supporting range requests for HTTP requests. Every cloud vendor allows generating pre-signed GET URLs for objects in cloud storage which makes HTTP a great escape hatch to support every cloud vendor. As far as I can tell Polars does not currently use range requests when retrieving a slice of a LazyFrame from an HTTP source.
It's not ideal because it requires more boilerplate and knowledge for the end user but it's not mutually exclusive with later growing built-in support for specific cloud vendors using their specific SDKs.
I've used this pattern a lot to get around stuff like boto3 not supporting async functionality in Python (you generate a presigned URL then use aiohttp/httpx to transfer the data).
This should now be supported.