ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: Unable to connect to athena

Open yeban opened this issue 6 months ago • 2 comments
trafficstars

What happened?

I followed the instructions at https://ibis-project.org/backends/athena to connect to and use Amazon Athena backend as follows:

import os
import ibis
from dotenv import load_dotenv

load_dotenv() # load env variables from .env

# the staging dir env variable below is of the format s3://MY_BUCKET
con = ibis.athena.connect(s3_staging_dir=os.environ['AWS_ATHENA_S3_STAGING_DIR'])

The call to ibis.athena.connect fails with the following error:

Failed to read MY_BUCKET/3d1008dc-55de-490e-b9f9-cd581b6f327f.csv.
Traceback (most recent call last):
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/result_set.py", line 240, in _read_csv
    self._fs.open_input_stream(f"{bucket}/{key}"),
  File "pyarrow/_fs.pyx", line 829, in pyarrow._fs.FileSystem.open_input_stream
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: When reading information for key '3d1008dc-55de-490e-b9f9-cd581b6f327f.csv' in bucket 'MY_BUCKET': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
Traceback (most recent call last):
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/result_set.py", line 240, in _read_csv
    self._fs.open_input_stream(f"{bucket}/{key}"),
  File "pyarrow/_fs.pyx", line 829, in pyarrow._fs.FileSystem.open_input_stream
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: When reading information for key '3d1008dc-55de-490e-b9f9-cd581b6f327f.csv' in bucket 'MY_BUCKET': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/ibis-test/ibis_try.py", line 6, in <module>
    con = ibis.athena.connect(s3_staging_dir=os.environ['AWS_ATHENA_S3_STAGING_DIR'])
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/__init__.py", line 110, in connect
    return backend.connect(*args, **kwargs)
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/backends/__init__.py", line 928, in connect
    new_backend.reconnect()
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/backends/__init__.py", line 943, in reconnect
    self.do_connect(*self._con_args, **self._con_kwargs)
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/backends/athena/__init__.py", line 366, in do_connect
    self._memtable_catalog = self.current_catalog
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/backends/athena/__init__.py", line 46, in current_catalog
    with self._safe_raw_sql(
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/ibis/backends/athena/__init__.py", line 291, in _safe_raw_sql
    yield cur.execute(query, *args, **kwargs)
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/cursor.py", line 136, in execute
    self.result_set = AthenaArrowResultSet(
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/result_set.py", line 79, in __init__
    self._table = self._as_arrow()
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/result_set.py", line 276, in _as_arrow
    table = self._read_csv()
  File "/tmp/ibis-test/virtualenv/lib/python3.10/site-packages/pyathena/arrow/result_set.py", line 251, in _read_csv
    raise OperationalError(*e.args) from e
pyathena.error.OperationalError: When reading information for key '3d1008dc-55de-490e-b9f9-cd581b6f327f.csv' in bucket 'MY_BUCKET': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

I expected it to not throw error and that I would be able to call con.list_tables() next.

Additional details:

I am able to ls and cp the CSV file referred to in the above stacktrace just fine using aws cli:

aws --profile athena_profile s3 ls s3://MY_BUCKET/3d1008dc-55de-490e-b9f9-cd581b6f327f.csv.

The file contains:

$ cat 3d1008dc-55de-490e-b9f9-cd581b6f327f.csv
"_col0"
"awsdatacatalog"

I am able to use pyathena perfectly fine, which I believe is what ibis uses behind the scenes:

import os
from pyathena import connect
from dotenv import load_dotenv

load_dotenv()  # load env variables from .env

# All pyarrow config is done using env variables, so nothing is passed to connect()
cursor = connect().cursor()

cursor.execute("SELECT * FROM NAMESPACE.TABLE LIMIT 1")
print(cursor.description)
print(cursor.fetchall())

I am behind a corporate proxy. The proxy variables are defined via environment variables (http_proxy, https_proxy and their uppercase forms).

I am quite eager to be able to use ibis. Please let me know if I can provide any other details to help troubleshoot.

What version of ibis are you using?

ibis-framework 10.5.0

What backend(s) are you using, if any?

PyAthena 3.13.0

Relevant log output


Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

yeban avatar May 07 '25 08:05 yeban

Can you try reading a file using vanilla pyarrow?

cpcloud avatar May 07 '25 11:05 cpcloud

I tried. It looks like pyarrow doesn't read proxy settings from environment variables.

For example, the following doesn't work - it throws error like in the original post:

import pyarrow as pa
from pyarrow import fs

s3 = fs.SubTreeFileSystem("MY_BUCKET", fs.S3FileSystem(region='MY_REGION'))
fl = s3.open_input_file('3d1008dc-55de-490e-b9f9-cd581b6f327f.csv')
print(fl.readall())

Error:

Traceback (most recent call last):
  File "/tmp/ibis-test/pyarrow_test.py", line 6, in <module>
    fl = s3.open_input_file('3d1008dc-55de-490e-b9f9-cd581b6f327f.csv')
  File "pyarrow/_fs.pyx", line 787, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: When reading information for key '3d1008dc-55de-490e-b9f9-cd581b6f327f.csv' in bucket 'MY_BUCKET': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

But this works:

import os
import pyarrow as pa
from pyarrow import fs

s3 = fs.SubTreeFileSystem("MY_BUCKET", fs.S3FileSystem(region='MY_REGION', proxy_options=os.environ['http_proxy']))
fl = s3.open_input_file('3d1008dc-55de-490e-b9f9-cd581b6f327f.csv')
print(fl.readall())

yeban avatar May 07 '25 15:05 yeban