polars
polars copied to clipboard
Testing with scan_parquet doesn't work anymore from within `io/cloud/test_aws.py`
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Just re-add (pl.scan_parquet, "parquet"), to the parameters of test_scan_s3.
(removed by @ritchie46 in PR #11210 )
Log output
exceptions.ComputeError: Generic S3 error: response error "request error", after 0 retries: builder error for url (http://127.0.0.1:5000/bucket/foods1.parquet): URL scheme is not allowed
Issue description
The call fails. I believe because the object_store crate doesn't like http very much.
So, I added (according to the object_store docs here:
# monkeypatch_module.setenv("AWS_ENDPOINT", f"http://{host}:{port}")
monkeypatch_module.setenv("AWS_ALLOW_HTTP", "true")
to the s3_base fixture (same file). (I tried with both the endpoint enabled and disabled)
But this just locked (deadlock?) the test. Ie:
INFO werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket HTTP/1.1" 200 -
INFO werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.csv HTTP/1.1" 200 -
INFO werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.ipc HTTP/1.1" 200 -
INFO werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.parquet HTTP/1.1" 200 -
Terminated
The Terminated is because I killed the process myself after a minute or so.
This is fairly similar to #11372, but I created this new thread because I purely focus on the testing in here.
Expected behavior
I would expect the scan_parquet to read in a LazyFrame.
Installed versions
(main branch)
--------Version info---------
Polars: 0.19.7
Index type: UInt32
Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
----Optional dependencies----
adbc_driver_sqlite: 0.7.0
cloudpickle: 2.2.1
connectorx: 0.3.2
deltalake: 0.10.1
fsspec: 2023.9.2
gevent: 23.9.1
matplotlib: 3.8.0
numpy: 1.26.0
openpyxl: 3.1.2
pandas: 2.1.1
pyarrow: 13.0.0
pydantic: 2.4.2
pyiceberg: 0.5.0
pyxlsb: 1.0.10
sqlalchemy: 2.0.21
xlsx2csv: 0.8.1
xlsxwriter: 3.1.6
It is because object store tries to connect to aws. This has more to do with making this work with mojo testing than being an actual bug in the aws connection code.
Indeed, but if it's not tested, how can we (read: I) improve on it? 😁
I'm trying to make the sink_parquet work with the object_store code (ticket #11056), but if I can't test it... I can't fix it. And I don't know rust that well (better now I'm digging into it, but still)... So if it's not too much of an issue:
- Could you describe what is needed to make it work?
- Or if it's faster: fix the tests?
Thanks
@svaningelgem I observed the same issue while trying to use a ThreadedMotoServer. Instead, you can get this to work if you launch moto_server as a subprocess. I am currently using this as a workaround for polars + s3 testing in python.