polars icon indicating copy to clipboard operation
polars copied to clipboard

scan_ds with S3 Dataset Errors on Curl Function

Open StuartHadfield opened this issue 2 years ago • 5 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

Trying to perform operations on a pyarrow dataset stored on S3 returns an obscure CURL error. I am able to interact with the dataset using vanilla pyarrow.

Reproducible example

Constructing a dataset as follows:

import polars as pl
import pyarrow.dataset as ds
from pyarrow.fs import S3FileSystem

filesystem = S3FileSystem(
  secret_key='my-aws-secret-key',
  access_key='my-aws-access-key',
)

my_dataset = ds.dataset('/some/s3/dir', filesystem=filesystem)
lf = pl.scan_ds(my_dataset)

pl_lf.head().collect()

Returns

*** exceptions.ComputeError: PyErr { type: <class 'OSError'>, value: OSError("When reading information for key 'path/to/parquet/file.parquet' in bucket 'my-bucket': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43, A lib curl function was given a bad argument") traceback: Some(<traceback object at 0x...>) }

Expected behavior

I'd expect this to return the a DataFrame with the first 5 rows, not a CURL error. Aware this could be some obscure S3 interaction going awry but I've no clue how to debug - figured asking was worth a shot! Thanks for all your work on Polars!

Installed versions

---Version info---
Polars: 0.16.9
Index type: UInt32
Platform: Linux-4.18.0-372.19.1.el8_6.x86_64-x86_64-with-glibc2.2.5
Python: 3.8.12 (default Apr 21 2022, 07:55:08)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-10)
---Optional dependencies---
pyarrow: 11.0.0
pandas: 1.5.3
numpy: 1.24.2

StuartHadfield avatar Mar 01 '23 17:03 StuartHadfield

Not sure if relevant - but this does execute for ages before giving me any feedback. I had anticipated that taking a head would be super fast.

StuartHadfield avatar Mar 01 '23 17:03 StuartHadfield

Not sure if this will help or not - but:

The curl mention suggests that error is coming from pyarrow

It looks like there is a partial call then a pickling when you use .scan_ds():

https://github.com/pola-rs/polars/blob/master/py-polars/polars/internals/anonymous_scan.py#L80-L81

Perhaps you could try using pl.from_arrow(ds.to_table(... directly to see if that succeeds:

https://github.com/pola-rs/polars/blob/master/py-polars/polars/internals/anonymous_scan.py#L57

That may help isolate where the problem is.

cmdlineluser avatar Mar 01 '23 17:03 cmdlineluser

OK, so I can't load the entire dataset cause my process gets OOM killed, but loading out just a few columns is okay.

pl.from_arrow(my_dataset.to_table(columns=['foo', 'bar'])
> shape: (...)

I wonder if the size of the data is an issue here?

StuartHadfield avatar Mar 01 '23 20:03 StuartHadfield

A bit more debugging info:

  • I'm able to load the entire dataset into an in-memory table with just pyarrow (i.e. my_dataset.to_table() works).
  • Copying the dataset down (like 3.6GB, GZIP compressed) and loading it from my local dir (to eliminate the CURL <> S3 component) OOMs the process (32G memory, 10G swap).

I'm curious as to why Polars is ooming where PyArrow handles the dataset size kinda fine... :/

StuartHadfield avatar Mar 02 '23 09:03 StuartHadfield

Searching the tracker for "OOM" brings up a recently filed issue which seems to be related:

https://github.com/pola-rs/polars/issues/7049

I'm not sure what the next step in debugging this is, it's going to need the help of the polars team.

cmdlineluser avatar Mar 02 '23 12:03 cmdlineluser