polars
polars copied to clipboard
scan_ds with S3 Dataset Errors on Curl Function
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
Trying to perform operations on a pyarrow dataset stored on S3 returns an obscure CURL error. I am able to interact with the dataset using vanilla pyarrow.
Reproducible example
Constructing a dataset as follows:
import polars as pl
import pyarrow.dataset as ds
from pyarrow.fs import S3FileSystem
filesystem = S3FileSystem(
secret_key='my-aws-secret-key',
access_key='my-aws-access-key',
)
my_dataset = ds.dataset('/some/s3/dir', filesystem=filesystem)
lf = pl.scan_ds(my_dataset)
pl_lf.head().collect()
Returns
*** exceptions.ComputeError: PyErr { type: <class 'OSError'>, value: OSError("When reading information for key 'path/to/parquet/file.parquet' in bucket 'my-bucket': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43, A lib curl function was given a bad argument") traceback: Some(<traceback object at 0x...>) }
Expected behavior
I'd expect this to return the a DataFrame with the first 5 rows, not a CURL error. Aware this could be some obscure S3 interaction going awry but I've no clue how to debug - figured asking was worth a shot! Thanks for all your work on Polars!
Installed versions
---Version info---
Polars: 0.16.9
Index type: UInt32
Platform: Linux-4.18.0-372.19.1.el8_6.x86_64-x86_64-with-glibc2.2.5
Python: 3.8.12 (default Apr 21 2022, 07:55:08)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-10)
---Optional dependencies---
pyarrow: 11.0.0
pandas: 1.5.3
numpy: 1.24.2
Not sure if relevant - but this does execute for ages before giving me any feedback. I had anticipated that taking a head would be super fast.
Not sure if this will help or not - but:
The curl mention suggests that error is coming from pyarrow
It looks like there is a partial call then a pickling when you use .scan_ds():
https://github.com/pola-rs/polars/blob/master/py-polars/polars/internals/anonymous_scan.py#L80-L81
Perhaps you could try using pl.from_arrow(ds.to_table(... directly to see if that succeeds:
https://github.com/pola-rs/polars/blob/master/py-polars/polars/internals/anonymous_scan.py#L57
That may help isolate where the problem is.
OK, so I can't load the entire dataset cause my process gets OOM killed, but loading out just a few columns is okay.
pl.from_arrow(my_dataset.to_table(columns=['foo', 'bar'])
> shape: (...)
I wonder if the size of the data is an issue here?
A bit more debugging info:
- I'm able to load the entire dataset into an in-memory table with just pyarrow (i.e.
my_dataset.to_table()works). - Copying the dataset down (like 3.6GB, GZIP compressed) and loading it from my local dir (to eliminate the CURL <> S3 component) OOMs the process (32G memory, 10G swap).
I'm curious as to why Polars is ooming where PyArrow handles the dataset size kinda fine... :/
Searching the tracker for "OOM" brings up a recently filed issue which seems to be related:
https://github.com/pola-rs/polars/issues/7049
I'm not sure what the next step in debugging this is, it's going to need the help of the polars team.