Unable to load an iceberg table from aws glue catalog
Question
PyIceberg version: 0.6.0 Python version: 3.11.1
Comments:
- Iceberg tables are saved in a AWS Glue catalog
- catalog, list of namespaces and list of tables are retrievable through the catalog api
Hi,
I am facing issues loading iceberg tables from AWS Glue. The code I am using is as follow:
from opensea.resources.resources import *
import pyiceberg.catalog
profile_name = "saml2aws_profile_name"
catalog_name = "catalog name"
table_name = "table name"
aws_region = "aws region"
catalog = pyiceberg.catalog.load_catalog(
catalog_name, **{"type": "glue", "profile_name": profile_name}
)
print(catalog.list_namespaces())
table = catalog.load_table((catalog_name, table_name))
The code allow me to:
- list namespaces
- list tables
But load_table throw the following error:
Traceback (most recent call last):
File "/path/to/the/project/testing.py", line 15, in <module>
table = catalog.load_table((catalog_name, table_name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 473, in load_table
return self._convert_glue_to_iceberg(self._get_glue_table(database_name=database_name, table_name=table_name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 296, in _convert_glue_to_iceberg
metadata = FromInputFile.table_metadata(file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/serializers.py", line 112, in table_metadata
with input_file.open() as input_stream:
^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 263, in open
input_file = self._filesystem.open_input_file(self._path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_fs.pyx", line 780, in pyarrow._fs.FileSystem.open_input_file
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.
I have checked I have the proper accesses, but it wasn't the issue. I have tried a few other things but they were all unsuccessful.
- using load_glue, instead of load_catalog
- providing access_key and secret_key directly in the load_catalog call
The table definition is as follow and was created via Trino:
create table catalog_name.table_name (
"timestamp" timestamp,
"type" varchar(20),
distribution int,
service int,
code varchar(20),
base_id bigint,
counter_id bigint,
"category" varchar(50),
volume double)
with (
format = 'PARQUET',
partitioning = ARRAY['day(timestamp)'],
location = 's3://s3_bucket/path/to/table/folder/'
)
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.
This seems to be an issue with reading the metadata file. Specifically, this line https://github.com/apache/iceberg-python/blob/781096eb0c71fa540357e0e6e3b51104ad6469ee/pyiceberg/catalog/glue.py#L320
What is the metadata_location of the table in the Glue catalog?
Glue point to that same file:
I have tried reading this table using PySpark, and it worked. Nevertheless, PySpark isn't the ideal solution for my case.
If it works in PySpark, it's probably not the Glue configuration but in pyiceberg.
Can you double-check the AWS settings? Your AWS profile looks like it can access the Glue catalog and read its content. Does it have permission to read the underlying s3 file?
Secondly,
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.
That S3 path looks fishy to me. Esp the prefix path/to/s3/table/location/metadata/ and no s3://. We can also check if PyArrow FS is parsing the metadata_location correctly
Can you double-check the AWS settings? Your AWS profile looks like it can access the Glue catalog and read its content. Does it have permission to read the underlying s3 file?
Yes, the profile I am using can access the underlying files in S3.
That S3 path looks fishy to me. Esp the prefix
path/to/s3/table/location/metadata/and nos3://. We can also check if PyArrow FS is parsing the metadata_location correctly
The path I am using starts, indeed, with s3://.
The load_table operation is doing a couple of different things.
Let's verify each step.
Getting the "glue table" object, using the _get_glue_table function
catalog = pyiceberg.catalog.load_catalog(
catalog_name, **{"type": "glue", "profile_name": profile_name}
)
identifier_tuple = catalog.identifier_to_tuple_without_catalog(identifier)
database_name, table_name = catalog.identifier_to_database_and_table(identifier_tuple, NoSuchTableError)
glue_table = catalog._get_glue_table(database_name=database_name, table_name=table_name)
print(glue_table)
Look at glue table metadata location
properties = glue_table["Parameters"]
METADATA_LOCATION = "metadata_location"
metadata_location = properties[METADATA_LOCATION]
print(metadata_location)
Load the metadata file, check the io implementation
io = load_file_io(properties=catalog.properties, location=metadata_location)
print(io)
file = io.new_input(metadata_location)
print(file)
metadata = FromInputFile.table_metadata(file)
print(metadata)
https://github.com/apache/iceberg/issues/6820
similar sounding issue
Your glue calls look, fine but your S3 calls are the problem. I was able to reproduce the issue by having the incorrect region for my AWS profile at ./aws/config and passing in the region config upon initializing the catalog.
aws_config
[test]
region = us-east-1
catalog init
catalog = pyiceberg.catalog.load_catalog(
catalog_name, **{"type": "glue", "profile_name": "test", "region_name": "us-west-2"}
)
Which leads to this exception
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'test/metadata/00000-c0fc4e45-d79d-41a1-ba92-a4122c09171c.metadata.json' in bucket 'test_bucket': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body.
It looks like when we infer the correct FileIO the PyarrowFs doesn't utilize the aws profile config. Which might be delegating the calls to the default profile instead.
https://github.com/apache/iceberg-python/blob/7fcdb8d25dfa2498ba98a2b8e8d2b327d85fa7c9/pyiceberg/io/pyarrow.py#L339-L357
We might need to feed the credentials into the session properties before inferring the FileIO in the GlueCatalog, so that we actually use the correct profile when reading from S3. For now you should be able to work around this by ensuring the profiles region is in sync with the config passed into the catalog. Or pass in the s3.region property into the catalog
edit: just saw the message above the fix is also there
@geruh thanks for the explanation! Would you say this is a bug in how pyiceberg configures S3? I'm not familiar with the AWS profile config. It seems like if a profile config is passed in, we don't want to override other S3 options, such as region in this case.
No Problem!!
This could potentially be a bug if we assume that the catalog and FileIO (S3) share the same aws profile configs. On one side, having a single profile configuration is convenient for the user's boto client, as it allows initializing all AWS clients with the correct credentials. However, on the other hand, we could argue that this configuration should only work at the catalog level, and for filesystems, separate configurations might be required. I'm inclined towards the first option. However, we are using pyarrow's S3FileSystem implementation, which has no concept of a aws profile. Therefore, we will need to initialize these values through boto's session.get_credentials() and pass them to the filesystem.
I'll raise an issue for this
thank you! should we close this in favor of #570?
I have tried both solution, ie:
- setting the env variable to the proper AWS region
- providing it within the function call But I am always getting the same error:
Traceback (most recent call last):
File "/path/to/the/project/testing.py", line 15, in <module>
table = catalog.load_table((catalog_name, table_name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 473, in load_table
return self._convert_glue_to_iceberg(self._get_glue_table(database_name=database_name, table_name=table_name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 296, in _convert_glue_to_iceberg
metadata = FromInputFile.table_metadata(file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/serializers.py", line 112, in table_metadata
with input_file.open() as input_stream:
^^^^^^^^^^^^^^^^^
File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 263, in open
input_file = self._filesystem.open_input_file(self._path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_fs.pyx", line 780, in pyarrow._fs.FileSystem.open_input_file
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.
Interesting can you run aws sts get-caller-identity in the terminal to ensure the right identity is being used?
you can also, explicitly set the S3FileIO by passing in the s3 configuration properties into the catalog.
catalog = pyiceberg.catalog.load_catalog(catalog_name,
**{
"type": "glue",
"profile": profile_name,
"s3.access-key-id": "access-key",
"s3.secret-access-key": "secret-access-key",
"s3.region": "us-east-1"
})
Interesting can you run
aws sts get-caller-identityin the terminal to ensure the right identity is being used?you can also, explicitly set the S3FileIO by passing in the s3 configuration properties into the catalog.
catalog = pyiceberg.catalog.load_catalog(catalog_name, **{ "type": "glue", "profile": profile_name, "s3.access-key-id": "access-key", "s3.secret-access-key": "secret-access-key", "s3.region": "us-east-1" })
this worked for me when i also added the token information for the s3
catalog = load_catalog( "default", **{"type": "glue", "aws_access_key_id": "ASAXXXXXXXXXX", "aws_secret_access_key": "0VLxnXXXXXXXXXXX", "aws_session_token": "IQJb3JpXXXXXXXXXXXXXXXXXXXXXXXX", "s3.access-key-id": "ASAXXXXXXXXXX", "s3.secret-access-key": "0VLxnXXXXXXXXXXX", "s3.session-token": "IQJb3JpXXXXXXXXXXXXXXXXXXXXXXXX", "s3.region": "eu-west-1", "region_name": "eu-west-1" }, )
We have the same problem here. My manager and me tried to get it to work in parallel and both ran into the same error. We assumed it is a permission issue, but even with admin credentials it didn't work. We used access token, tried to set region manually, provided AWS profile name and alternatively the access keys. No success.
My guess is that it has something to do with the s3fs package used to read the metadata file.
We had the same problem within our Airflow deployment. The easy fix for us would have been to set the default aws credentials through environment variables:
AWS_ACCESS_KEY_ID=<aws region>
AWS_DEFAULT_REGION=<aws access key>
AWS_SECRET_ACCESS_KEY=<aws secret key>
This, however, wasn't feasible because of deployment issues. Long story short, we ended up with this solution:
glue_catalog_conf = {
"s3.region": <aws region>,
"s3.access-key-id": <aws access key>,
"s3.secret-access-key": <aws secret key>,
"region_name": <aws region>,
"aws_access_key_id": <aws access_key>,
"aws_secret_access_key": <aws secret key>,
}
catalog: GlueCatalog = load_catalog(
"some_name",
type="glue",
**glue_catalog_conf
)
If you come from a google search, please take everything that follows with a grain of salt, because we have no previous experience with either pyiceberg or airflow. Anyway.
We came to this conclusion (that we needed to pass both formats) because it seems that the the boto client initialization expects one format (the second set in the above snippet):
class GlueCatalog(Catalog):
def __init__(self, name: str, **properties: Any):
super().__init__(name, **properties)
session = boto3.Session(
profile_name=properties.get("profile_name"),
region_name=properties.get("region_name"),
botocore_session=properties.get("botocore_session"),
aws_access_key_id=properties.get("aws_access_key_id"),
aws_secret_access_key=properties.get("aws_secret_access_key"),
aws_session_token=properties.get("aws_session_token"),
)
self.glue: GlueClient = session.client("glue")
And the same set of properties is passed to the load_file_io pyiceberg function, which, to the extent of our very limited understanding, expects the other format (s3.stuff):
io = load_file_io(properties=self.properties, location=metadata_location)
file = io.new_input(metadata_location)
metadata = FromInputFile.table_metadata(file)
return Table(
identifier=(self.name, database_name, table_name),
metadata=metadata,
metadata_location=metadata_location,
io=self._load_file_io(metadata.properties, metadata_location),
catalog=self,
)
We might be completely off base here, of course, and what ultimately convinced us to adopt the above solution is just that it works, while passing either set of credentials without the other wouldn't work for us.
We're using:
aiobotocore==2.13.1
boto3==1.34.51
botocore==1.34.131
[...]
pyiceberg==0.6.1
We're still unclear on whether it's indeed a bug or we're just using the APIs improperly, any help would be appreciated.
Have a nice day!
@impproductions Thanks for the detailed explanation. Great catch!
Looking through the code, there's indeed an expectation for both AWS credential formats.
s3.access-key-id vs aws_access_key_id
s3.secret-access-key vs aws_secret_access_key
This issue exists for both glue and dynamodb catalogs
https://github.com/search?q=repo%3Aapache%2Ficeberg-python+aws_secret_access_key+path%3A.py+-path%3Atests&type=code
Opened #892 to track the issue with AWS credential formats
Fixed in #922