ibis
ibis copied to clipboard
bug: Unable to connect to athena
What happened?
I followed the instructions at https://ibis-project.org/backends/athena to connect to and use Amazon Athena backend as follows:
import os
import ibis
from dotenv import load_dotenv
load_dotenv() # load env variables from .env
# the staging dir env variable below is of the format s3://MY_BUCKET
con = ibis.athena.connect(s3_staging_dir=os.environ['AWS_ATHENA_S3_STAGING_DIR'])
The call to ibis.athena.connect fails with the following error:
Failed to read MY_BUCKET/3d1008dc-55de-490e-b9f9-cd581b6f327f.csv.
Traceback (most recent call last):
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/result_set.py", line 240, in _read_csv
self._fs.open_input_stream(f"{bucket}/{key}"),
File "pyarrow/_fs.pyx", line 829, in pyarrow._fs.FileSystem.open_input_stream
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: When reading information for key '3d1008dc-55de-490e-b9f9-cd581b6f327f.csv' in bucket 'MY_BUCKET': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
Traceback (most recent call last):
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/result_set.py", line 240, in _read_csv
self._fs.open_input_stream(f"{bucket}/{key}"),
File "pyarrow/_fs.pyx", line 829, in pyarrow._fs.FileSystem.open_input_stream
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: When reading information for key '3d1008dc-55de-490e-b9f9-cd581b6f327f.csv' in bucket 'MY_BUCKET': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/tmp/ibis-test/ibis_try.py", line 6, in <module>
con = ibis.athena.connect(s3_staging_dir=os.environ['AWS_ATHENA_S3_STAGING_DIR'])
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/__init__.py", line 110, in connect
return backend.connect(*args, **kwargs)
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/backends/__init__.py", line 928, in connect
new_backend.reconnect()
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/backends/__init__.py", line 943, in reconnect
self.do_connect(*self._con_args, **self._con_kwargs)
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/backends/athena/__init__.py", line 366, in do_connect
self._memtable_catalog = self.current_catalog
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/backends/athena/__init__.py", line 46, in current_catalog
with self._safe_raw_sql(
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/backends/athena/__init__.py", line 291, in _safe_raw_sql
yield cur.execute(query, *args, **kwargs)
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/cursor.py", line 136, in execute
self.result_set = AthenaArrowResultSet(
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/result_set.py", line 79, in __init__
self._table = self._as_arrow()
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/result_set.py", line 276, in _as_arrow
table = self._read_csv()
File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/result_set.py", line 251, in _read_csv
raise OperationalError(*e.args) from e
pyathena.error.OperationalError: When reading information for key '3d1008dc-55de-490e-b9f9-cd581b6f327f.csv' in bucket 'MY_BUCKET': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
I expected it to not throw error and that I would be able to call con.list_tables() next.
Additional details:
I am able to ls and cp the CSV file referred to in the above stacktrace just fine using aws cli:
aws --profile athena_profile s3 ls s3://MY_BUCKET/3d1008dc-55de-490e-b9f9-cd581b6f327f.csv.
The file contains:
$ cat 3d1008dc-55de-490e-b9f9-cd581b6f327f.csv
"_col0"
"awsdatacatalog"
I am able to use pyathena perfectly fine, which I believe is what ibis uses behind the scenes:
import os
from pyathena import connect
from dotenv import load_dotenv
load_dotenv() # load env variables from .env
# All pyarrow config is done using env variables, so nothing is passed to connect()
cursor = connect().cursor()
cursor.execute("SELECT * FROM NAMESPACE.TABLE LIMIT 1")
print(cursor.description)
print(cursor.fetchall())
I am behind a corporate proxy. The proxy variables are defined via environment variables (http_proxy, https_proxy and their uppercase forms).
I am quite eager to be able to use ibis. Please let me know if I can provide any other details to help troubleshoot.
What version of ibis are you using?
ibis-framework 10.5.0
What backend(s) are you using, if any?
PyAthena 3.13.0
Relevant log output
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Can you try reading a file using vanilla pyarrow?
I tried. It looks like pyarrow doesn't read proxy settings from environment variables.
For example, the following doesn't work - it throws error like in the original post:
import pyarrow as pa
from pyarrow import fs
s3 = fs.SubTreeFileSystem("MY_BUCKET", fs.S3FileSystem(region='MY_REGION'))
fl = s3.open_input_file('3d1008dc-55de-490e-b9f9-cd581b6f327f.csv')
print(fl.readall())
Error:
Traceback (most recent call last):
File "/tmp/ibis-test/pyarrow_test.py", line 6, in <module>
fl = s3.open_input_file('3d1008dc-55de-490e-b9f9-cd581b6f327f.csv')
File "pyarrow/_fs.pyx", line 787, in pyarrow._fs.FileSystem.open_input_file
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: When reading information for key '3d1008dc-55de-490e-b9f9-cd581b6f327f.csv' in bucket 'MY_BUCKET': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
But this works:
import os
import pyarrow as pa
from pyarrow import fs
s3 = fs.SubTreeFileSystem("MY_BUCKET", fs.S3FileSystem(region='MY_REGION', proxy_options=os.environ['http_proxy']))
fl = s3.open_input_file('3d1008dc-55de-490e-b9f9-cd581b6f327f.csv')
print(fl.readall())