polars icon indicating copy to clipboard operation
polars copied to clipboard

Testing with scan_parquet doesn't work anymore from within `io/cloud/test_aws.py`

Open svaningelgem opened this issue 2 years ago • 3 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Just re-add (pl.scan_parquet, "parquet"), to the parameters of test_scan_s3.

(removed by @ritchie46 in PR #11210 )

Log output

exceptions.ComputeError: Generic S3 error: response error "request error", after 0 retries: builder error for url (http://127.0.0.1:5000/bucket/foods1.parquet): URL scheme is not allowed

Issue description

The call fails. I believe because the object_store crate doesn't like http very much. So, I added (according to the object_store docs here:

    # monkeypatch_module.setenv("AWS_ENDPOINT", f"http://{host}:{port}")
    monkeypatch_module.setenv("AWS_ALLOW_HTTP", "true")

to the s3_base fixture (same file). (I tried with both the endpoint enabled and disabled)

But this just locked (deadlock?) the test. Ie:

INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket HTTP/1.1" 200 -
INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.csv HTTP/1.1" 200 -
INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.ipc HTTP/1.1" 200 -
INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.parquet HTTP/1.1" 200 -
Terminated

The Terminated is because I killed the process myself after a minute or so.

This is fairly similar to #11372, but I created this new thread because I purely focus on the testing in here.

Expected behavior

I would expect the scan_parquet to read in a LazyFrame.

Installed versions

(main branch)
--------Version info---------
Polars:              0.19.7
Index type:          UInt32
Platform:            Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:              3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_sqlite:  0.7.0
cloudpickle:         2.2.1
connectorx:          0.3.2
deltalake:           0.10.1
fsspec:              2023.9.2
gevent:              23.9.1
matplotlib:          3.8.0
numpy:               1.26.0
openpyxl:            3.1.2
pandas:              2.1.1
pyarrow:             13.0.0
pydantic:            2.4.2
pyiceberg:           0.5.0
pyxlsb:              1.0.10
sqlalchemy:          2.0.21
xlsx2csv:            0.8.1
xlsxwriter:          3.1.6

svaningelgem avatar Oct 05 '23 08:10 svaningelgem

It is because object store tries to connect to aws. This has more to do with making this work with mojo testing than being an actual bug in the aws connection code.

ritchie46 avatar Oct 06 '23 09:10 ritchie46

Indeed, but if it's not tested, how can we (read: I) improve on it? 😁

I'm trying to make the sink_parquet work with the object_store code (ticket #11056), but if I can't test it... I can't fix it. And I don't know rust that well (better now I'm digging into it, but still)... So if it's not too much of an issue:

  • Could you describe what is needed to make it work?
  • Or if it's faster: fix the tests?

Thanks

svaningelgem avatar Oct 06 '23 09:10 svaningelgem

@svaningelgem I observed the same issue while trying to use a ThreadedMotoServer. Instead, you can get this to work if you launch moto_server as a subprocess. I am currently using this as a workaround for polars + s3 testing in python.

TylerGrantSmith avatar Dec 05 '23 17:12 TylerGrantSmith