polars icon indicating copy to clipboard operation
polars copied to clipboard

Polars panics when Azure Storage URI is used

Open andrei-ionescu opened this issue 3 years ago • 14 comments

What language are you using?

Rust

Which feature gates did you use?

"polars-io", "parquet", "lazy", "dtype-struct"

Have you tried latest version of polars?

  • [yes]

What version of polars are you using?

0.22.8

What operating system are you using polars on?

macOS Monterey 12.3.1

What language version are you using

$ rustc --version
rustc 1.64.0-nightly (495b21669 2022-07-03)

$ cargo --version
cargo 1.64.0-nightly (dbff32b27 2022-06-24)

Describe your bug.

Using Azure Storage URIs like abss://[email protected]/my/adls/folder/my_file.parquet results in panic with "No such file or directory" message.

What are the steps to reproduce the behavior?

Given a parquet file located on an Azure Storage account

When the following code is executed:

let df = LazyFrame::scan_parquet(
        "abfss://[email protected]/my/adls/folder/my_file.parquet".to_string(), 
        ScanArgsParquet::default())
        .unwrap()
        .select([all()])
        .collect()
        .unwrap();

It panics with

thread 'main' panicked at 'called `Result::unwrap()` on 
an `Err` value: Io(Os { code: 2, kind: NotFound, message: "No such file or directory" })'

What is the actual behavior?

The error is:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Io(Os { code: 2, kind: NotFound, message: "No such file or directory" })', 
src/main.rs:9:10
stack backtrace:
   0:        0x110661b24 - std::backtrace_rs::backtrace::libunwind::trace::h515e8409b092ccae
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   1:        0x110661b24 - std::backtrace_rs::backtrace::trace_unsynchronized::haadb9478c42974b3
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x110661b24 - std::sys_common::backtrace::_print_fmt::h582ca8eb4769ca98
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:66:5
   3:        0x110661b24 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5739599580de7c03
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:45:22
   4:        0x110680bbb - core::fmt::write::hdbd915c356b4a35c
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/fmt/mod.rs:1198:17
   5:        0x11065e69c - std::io::Write::write_fmt::hedec9ebe64f68a8c
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/io/mod.rs:1672:15
   6:        0x110663237 - std::sys_common::backtrace::_print::hbba57d8ca7ac5872
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:48:5
   7:        0x110663237 - std::sys_common::backtrace::print::hf73d56375edacf0a
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:35:9
   8:        0x110663237 - std::panicking::default_hook::{{closure}}::h6ea7fabe4546dbbe
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:295:22
   9:        0x110662f40 - std::panicking::default_hook::h3b0f5c43f1cc1cb7
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:314:9
  10:        0x110663861 - std::panicking::rust_panic_with_hook::h38e58db141a96cd6
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:698:17
  11:        0x1106637a3 - std::panicking::begin_panic_handler::{{closure}}::h69d5f77924609c80
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:588:13
  12:        0x110661fd7 - std::sys_common::backtrace::__rust_end_short_backtrace::ha6c7f778cf12b0cc
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:138:18
  13:        0x11066347d - rust_begin_unwind
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:584:5
  14:        0x11073ee03 - core::panicking::panic_fmt::h7d9122ca971122ab
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/panicking.rs:142:14
  15:        0x11073ef65 - core::result::unwrap_failed::h08cd9ea7c15b7964
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/result.rs:1805:5
  16:        0x10d652612 - core::result::Result<T,E>::unwrap::h77fd2c61b74f7953
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/result.rs:1098:23
  17:        0x10d652aef - gyrfalcon::main::h47681dcf9707e8b6
                               at /.../rust-polars-delta/gyrfalcon/src/main.rs:6:14
  18:        0x10d652835 - core::ops::function::FnOnce::call_once::h64fe34f6f736f7dd
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/ops/function.rs:248:5
  19:        0x10d652e48 - std::sys_common::backtrace::__rust_begin_short_backtrace::he874fbdf51352f9a
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/sys_common/backtrace.rs:122:18
  20:        0x10d652eb8 - std::rt::lang_start::{{closure}}::hfba7ca7366433629
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/rt.rs:145:18
  21:        0x1106596a7 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h324365c9de115800
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/core/src/ops/function.rs:280:13
  22:        0x1106596a7 - std::panicking::try::do_call::h984b8436f15c2b0a
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:492:40
  23:        0x1106596a7 - std::panicking::try::h4851f4f0c47f2f24
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:456:19
  24:        0x1106596a7 - std::panic::catch_unwind::h343fc9d072428d4a
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panic.rs:137:14
  25:        0x1106596a7 - std::rt::lang_start_internal::{{closure}}::hadf732d92a9472cb
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/rt.rs:128:48
  26:        0x1106596a7 - std::panicking::try::do_call::h82e6b2bfab907999
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:492:40
  27:        0x1106596a7 - std::panicking::try::h53c3237109262e5b
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panicking.rs:456:19
  28:        0x1106596a7 - std::panic::catch_unwind::h2b03f33cc8b8f4c2
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/panic.rs:137:14
  29:        0x1106596a7 - std::rt::lang_start_internal::hba2917d8cf49a187
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/rt.rs:128:20
  30:        0x10d652e8e - std::rt::lang_start::h67b7db7772698b4e
                               at /rustc/495b216696ccbc27c73d6bdc486bf4621d610f4b/library/std/src/rt.rs:144:17
  31:        0x10d652e26 - _main

What is the expected behavior?

To properly load the parquet file stored on an Azure Storage account.

andrei-ionescu avatar Jul 05 '22 16:07 andrei-ionescu

We only support local file systems. Remote storage blob is not in scope for the project (yet).

ritchie46 avatar Jul 05 '22 18:07 ritchie46

@ritchie46 is there a feature request opened for this? Polars can benefit a lot from the multiple use cases with external storages like S3, Azure Storage, Google Cloud Storage and other HDFS compatible storages.

andrei-ionescu avatar Jul 06 '22 13:07 andrei-ionescu

Doesn't Polars support reading from Read + Seek? There is a semantic relationship between Read + Seek and byte-range headers supported by azure/s3/gcp.

jorgecarleitao avatar Jul 06 '22 13:07 jorgecarleitao

Doesn't Polars support reading from Read + Seek?

For csv and parquet we use memmap.

There is a semantic relationship between Read + Seek and byte-range headers supported by azure/s3/gcp.

Are there already abstraction that map to appropriate cloud/http calls?

Currently, I don't yet see much benefit from supporting this vs a user adding a cloud provider crate and downloading the specific in memory. We are not yet able to benefit from any pushed down operations such as projections/predicates/slices.

We first need to have proper support for anonymous readers and then a way to deal with the pushed down optimizations such that we save network IO. But given that there are many different providers I'd want something standardized before building this.

ritchie46 avatar Jul 06 '22 13:07 ritchie46

@ritchie46 Do you have any example on how to integrate any external storage library with Polars? Given I have 1Gb parquet files stored on external storage like S3 or Azure Storage, how can I efficiently get only the chunk of data that Polars needs in the query? Downloading the whole 1Gb and putting it in memory it would be very costly both time and money.

andrei-ionescu avatar Jul 06 '22 14:07 andrei-ionescu

Downloading the file once and cache it. We are unable to do anything smart with blob storage. In the python bindings we use fsspec to get cloud files. That gives us an abstraction that looks like a file handle, but there we also just load it into memory and I would recommend to download the files.

ritchie46 avatar Jul 06 '22 14:07 ritchie46

@ritchie46 , wouldn't it be possible to abstract a Read + Seek to offer support for it? Note that this does not need to be async - sync works already - users can push the IO-bounded task to a separate thread pool if they want (by writing a custom struct that implements Read + Seek and calls https from that pool).

I understand @andrei-ionescu 's point that projection and filter pushdown would be advantageous in Rust. We already support a file descriptor in Python (which is basically dyn Read + Seek in Rust), so I think that supporting Read + Seek (even if dyn Read + Seek), could make sense?

But I may be missing something here - isn't the parquet reader code using std::fs::File and passing it to arrow2? Something like

enum Input {
    PathOrGlob(String),
    File(Box<dyn Read + Seek + ...>),
}

and branch it when reading?

jorgecarleitao avatar Jul 06 '22 14:07 jorgecarleitao

But I may be missing something here - isn't the parquet reader code using std::fs::File and passing it to arrow2? Something like

enum Input {
    PathOrGlob(String),
    File(Box<dyn Read + Seek + ...>),
}

We could do this branch indeed. :+1:

The thing I am afraid for is implementing this Box<dyn Read + Seek> that represents a cloud provider. I was under the assumption that this is a lot of work, but maybe somebody has some ideas on that.

ritchie46 avatar Jul 06 '22 15:07 ritchie46

@ritchie46, @jorgecarleitao: Should we transform this in feature then?

andrei-ionescu avatar Jul 07 '22 10:07 andrei-ionescu

@ritchie46, @jorgecarleitao: Should we transform this in feature then?

Jup.

@andrei-ionescu do you know if there is a dyn Read + Seek abstraction crate/ implementation for azure blob storage? I don't have any experience with azure.

ritchie46 avatar Jul 07 '22 10:07 ritchie46

We have something like that here for s3: https://github.com/DataEngineeringLabs/ranged-reader-rs/blob/main/tests/it/parquet_s3.rs#L6

for azure we need to find how to run range-bytes queries with its Rust API, i.e. the blocking version of this: https://github.com/Azure/azure-sdk-for-rust/blob/main/sdk/storage_blobs/examples/blob_range.rs#L42

jorgecarleitao avatar Jul 07 '22 14:07 jorgecarleitao

You also have this crate, but I am not sure it supports everything needed: https://docs.rs/object_store/latest/object_store/

ghuls avatar Jul 07 '22 15:07 ghuls

The Object Store library is currently missing support for Azure Data Lake Storage Gen2. So, abfss protocol won't work.

I created this ticket — https://github.com/apache/arrow-rs/issues/3283 — on the arrow-rs/objects_store repo to have the support for ADLS Gen2 added.

andrei-ionescu avatar Dec 06 '22 17:12 andrei-ionescu

I'd like to suggest that Polars start by supporting range requests for HTTP requests. Every cloud vendor allows generating pre-signed GET URLs for objects in cloud storage which makes HTTP a great escape hatch to support every cloud vendor. As far as I can tell Polars does not currently use range requests when retrieving a slice of a LazyFrame from an HTTP source.

It's not ideal because it requires more boilerplate and knowledge for the end user but it's not mutually exclusive with later growing built-in support for specific cloud vendors using their specific SDKs.

I've used this pattern a lot to get around stuff like boto3 not supporting async functionality in Python (you generate a presigned URL then use aiohttp/httpx to transfer the data).

adriangb avatar Jan 29 '23 22:01 adriangb

This should now be supported.

stinodego avatar Feb 17 '24 23:02 stinodego