pyiceberg with hive and S3 fails even when providing creds
I'm trying to use pyiceberg within a pod that has access via a role
I've configured PYICEBERG_CATALOG__DEFAULT__S3__ROLE_ARN and AWS_ROLE_ARN environment variables but that fails with a Headobject issue
File "/usr/local/lib64/python3.12/site-packages/pyiceberg/catalog/__init__.py", line 420, in create_table_if_not_exists
return self.create_table(identifier, schema, location, partition_spec, sort_order, properties)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.12/site-packages/pyiceberg/catalog/hive.py", line 404, in create_table
self._write_metadata(staged_table.metadata, staged_table.io, staged_table.metadata_location)
File "/usr/local/lib64/python3.12/site-packages/pyiceberg/catalog/__init__.py", line 939, in _write_metadata
ToOutputFile.table_metadata(metadata, io.new_output(metadata_path))
File "/usr/local/lib64/python3.12/site-packages/pyiceberg/serializers.py", line 130, in table_metadata
with output_file.create(overwrite=overwrite) as output_stream:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 338, in create
if not overwrite and self.exists() is True:
^^^^^^^^^^^^^
File "/usr/local/lib64/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 282, in exists
self._file_info() # raises FileNotFoundError if it does not exist
^^^^^^^^^^^^^^^^^
File "/usr/local/lib64/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 264, in _file_info
file_info = self._filesystem.get_file_info(self._path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_fs.pyx", line 590, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: When getting information for key 'schemas/meta.db/trino_queries_iceberg/metadata/00000-41568416-bc76-4236-afab-a7bec772eb32.metadata.json' in bucket 'REDACTED-BUCKET': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
I believe that's this Pyarrow issue https://github.com/apache/arrow/issues/38421
To get past this I set these keys upstream my python
import boto3
session = boto3.session.Session()
os.environ['AWS_ACCESS_KEY_ID']=session.get_credentials().access_key
os.environ['AWS_SECRET_ACCESS_KEY']=session.get_credentials().secret_key
os.environ['AWS_SESSION_TOKEN']=session.get_credentials().token
os.environ['PYICEBERG_CATALOG__DEFAULT__S3__ACCESS_KEY_ID']=session.get_credentials().access_key
os.environ['PYICEBERG_CATALOG__DEFAULT__S3__SECRET_ACCESS_KEY']=session.get_credentials().secret_key
os.environ['PYICEBERG_CATALOG__DEFAULT__S3__SESSION_TOKEN']=session.get_credentials().token
Thanks for filing this issue!
It looks like we do pass the s3.role-arn to the underlying pyarrow S3FileSystem and the issue is in the S3FileSystem itself, as described in https://github.com/apache/arrow/issues/38421
Was there an update on this? I think I am seeing the same issue.
My .pyiceberg looks like:
local:
uri: sqlite:///iceberg_catalog/catalog.db
warehouse: s3://my_bucket/warehouse
I was just following: https://estuary.dev/blog/getting-started-pyiceberg/
But if I change the warehouse to file://, it all works.
So I suspect the issue is with the S3 file system module. I tried setting the credentials like https://github.com/apache/iceberg-python/issues/1775#issue-2903374205 but still I see issues.
My credentials are set in ~/.aws/credentials and I do set the AWS_PROFILE and AWS_REGION varialble.