iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

pyiceberg with hive and S3 fails even when providing creds

Open lozbrown opened this issue 10 months ago • 1 comments

I'm trying to use pyiceberg within a pod that has access via a role

I've configured PYICEBERG_CATALOG__DEFAULT__S3__ROLE_ARN and AWS_ROLE_ARN environment variables but that fails with a Headobject issue

  File "/usr/local/lib64/python3.12/site-packages/pyiceberg/catalog/__init__.py", line 420, in create_table_if_not_exists
    return self.create_table(identifier, schema, location, partition_spec, sort_order, properties)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.12/site-packages/pyiceberg/catalog/hive.py", line 404, in create_table
    self._write_metadata(staged_table.metadata, staged_table.io, staged_table.metadata_location)
  File "/usr/local/lib64/python3.12/site-packages/pyiceberg/catalog/__init__.py", line 939, in _write_metadata
    ToOutputFile.table_metadata(metadata, io.new_output(metadata_path))
  File "/usr/local/lib64/python3.12/site-packages/pyiceberg/serializers.py", line 130, in table_metadata
    with output_file.create(overwrite=overwrite) as output_stream:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 338, in create
    if not overwrite and self.exists() is True:
                         ^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 282, in exists
    self._file_info()  # raises FileNotFoundError if it does not exist
    ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 264, in _file_info
    file_info = self._filesystem.get_file_info(self._path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 590, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: When getting information for key 'schemas/meta.db/trino_queries_iceberg/metadata/00000-41568416-bc76-4236-afab-a7bec772eb32.metadata.json' in bucket 'REDACTED-BUCKET': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

I believe that's this Pyarrow issue https://github.com/apache/arrow/issues/38421

To get past this I set these keys upstream my python

import boto3
session = boto3.session.Session()
os.environ['AWS_ACCESS_KEY_ID']=session.get_credentials().access_key
os.environ['AWS_SECRET_ACCESS_KEY']=session.get_credentials().secret_key
os.environ['AWS_SESSION_TOKEN']=session.get_credentials().token
os.environ['PYICEBERG_CATALOG__DEFAULT__S3__ACCESS_KEY_ID']=session.get_credentials().access_key
os.environ['PYICEBERG_CATALOG__DEFAULT__S3__SECRET_ACCESS_KEY']=session.get_credentials().secret_key
os.environ['PYICEBERG_CATALOG__DEFAULT__S3__SESSION_TOKEN']=session.get_credentials().token

lozbrown avatar Mar 07 '25 15:03 lozbrown

Thanks for filing this issue!

It looks like we do pass the s3.role-arn to the underlying pyarrow S3FileSystem and the issue is in the S3FileSystem itself, as described in https://github.com/apache/arrow/issues/38421

kevinjqliu avatar Mar 11 '25 17:03 kevinjqliu

Was there an update on this? I think I am seeing the same issue.

My .pyiceberg looks like:

  local:
    uri: sqlite:///iceberg_catalog/catalog.db
    warehouse: s3://my_bucket/warehouse

I was just following: https://estuary.dev/blog/getting-started-pyiceberg/

But if I change the warehouse to file://, it all works.

So I suspect the issue is with the S3 file system module. I tried setting the credentials like https://github.com/apache/iceberg-python/issues/1775#issue-2903374205 but still I see issues.

My credentials are set in ~/.aws/credentials and I do set the AWS_PROFILE and AWS_REGION varialble.

rnadhani-ns avatar Aug 28 '25 13:08 rnadhani-ns