Support for S3 catalog to work with S3 Tables
Feature Request / Improvement
Amazon S3 tables have being launched, see this, and looks like that S3 tables have a managed iceberg catalog.
Based on https://github.com/awslabs/s3-tables-catalog it looks like that AWS build an S3 catalog wrapper using java, that can be used by query engines like Spark/Trino. It will be relevant to write to S3 tables via pyiceberg.
More context
Based on my understanding, once an S3 table is created, iceberg metadata are not initialized. For a freshly created table, it's possible to retrieve the warehouseLocation -> see get_table. The warehouseLocation looks like a unique S3 bucket, where you can put S3 objects in it. After putting the S3 objects of an iceberg commit operation: data+metadata, it's possible to use update_table_metadata_location to point the S3 table to the right location.
Note: I'm not 100% sure on the above - and I need to validate it via some tests.
Thanks for raising this @nicor88! Would you be interested to contribute this feature?
The catalog implementation can be found here https://github.com/awslabs/s3-tables-catalog/blob/adfeece9873f06364a4a093bfedacb5efe4a952b/src/software/amazon/s3tables/iceberg/S3TablesCatalog.java
I also would be interested in this feature.
@kevinjqliu Unfortunately I don't have the capacity at the moment to contribute to this feature. I would nevertheless be available to look at the PR and test the implementation.
I'm also interested, I will have a look at the reference @nicor88 provided and create a PR if I can get something to work:)
Super keen to see this happen too!
It looks like that the warehouse location of those S3 tables doesn't support List operations. I tried to point my local warehouse (using SQLite) to the warehouse location of an S3 table, just to validate if all could work, and I got this error:
AWS Error UNKNOWN (HTTP status 405) during ListObjectsV2 operation: Unable to parse ExceptionName: MethodNotAllowed Message: The specified method is not allowed against this resource.
The issue seems to come from pyarrow, that does this check:
if not overwrite and self.exists() is True:
raise FileExistsError(f"Cannot create file, already exists: {self.location}")
output_file = self._filesystem.open_output_stream(self._path, buffer_size=self._buffer_size)
The self.exists(), triggers under the hood a list operation, that it's not supported.....
I created an intial PR #1429 where I am currently working on supporting table creation. I ran into the same issue that @nicor88 described and could work around it by setting overwrite=True for now.
However, now I get a different error during the write operation for the table metadata:
AWS Error UNKNOWN (HTTP status 400) during CompleteMultipartUpload operation: Unable to parse ExceptionName: S3TablesUnsupportedHeader Message: S3 Tables does not support the following header: x-amz-api-version value: 2006-03-01
I'm currently going through the pyarrow S3FileSystem implementation to see where this header is being introduced.
EDIT:
I tried using a different FileIO and the issue disappears when using pyiceberg.io.fsspecFileIO explicitly via:
properties = {"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}
seems like this is indeed specific to pyarrow
@felixscherz thanks for catching this (and thanks to everyone who's interested in building S3 Tables support for PyIceberg!). We're working on an S3-side fix for the x-amz-api-version exception you're seeing; hoping to have that out soon.
@jamesbornholt Great to hear that you folks are keeping an eye on here! Sorry if this is not the right channel to ask the question but considering S3Tables is a mix of catalog + storage layer, is there any plan to provide Iceberg REST compatibility as part of the S3Tables in addition to current API?
IMO that would help accelerate the adoption a lot, otherwise all the Iceberg implementations will need to integrate with S3Tables separately and I have a feeling that maintenance will be non-trivial.
+1
@felixscherz thanks for catching this (and thanks to everyone who's interested in building S3 Tables support for PyIceberg!). We're working on an S3-side fix for the
x-amz-api-versionexception you're seeing; hoping to have that out soon.
I just re-ran the tests using PyArrowFileIO and it seems to be fixed now, thank you!
can we have full example code ?
@soumilshah1995 once #1429 is merged, an example would be:
from pyiceberg.catalog.s3tables import S3TablesCatalog
import pyarrow as pa
table_bucket_arn: str = "..."
aws_region: str = "..."
properties = {"s3tables.warehouse": table_bucket_arn, "s3tables.region": aws_region}
catalog = S3TablesCatalog(name="s3tables_catalog", **properties)
database_name = "prod"
catalog.create_namespace(namespace=database_name)
pyarrow_table = pa.Table.from_arrays(
[
pa.array([None, "A", "B", "C"]),
pa.array([1, 2, 3, 4]),
pa.array([True, None, False, True]),
pa.array([None, "A", "B", "C"]),
],
schema=pa.schema(
[
pa.field("foo", pa.large_string(), nullable=True),
pa.field("bar", pa.int32(), nullable=False),
pa.field("baz", pa.bool_(), nullable=True),
pa.field("large", pa.large_string(), nullable=True),
]
),
)
identifier = (database_name, "orders")
table = catalog.create_table(identifier=identifier, schema=pyarrow_table.schema)
table.append(pyarrow_table)
I'm currently working on implementing AWS S3 Tables for the moto library https://github.com/getmoto/moto/pull/8470 so once that is merged, we can improve the tests and merge #1429
I think that snippet would be great as part of the docs for S3 Tables :)
I will add that to the docs:) currently focusing on the moto side of things:)
Is there a timeline for getting this feature merged into a release? I know you were working on getting the testing framework on the moto side completed, but that looks like it was merged awhile back....we're eagerly awaiting s3 table support! Thank you!
Hi, the PR for this feature is done and in review at the moment: https://github.com/apache/iceberg-python/pull/1429 :)
https://aws.amazon.com/about-aws/whats-new/2025/03/amazon-s3-tables-apache-iceberg-rest-catalog-apis/
aws s3 table compatible with HTTP rest catalog api , ( I did not try it yet )
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-open-source.html
I guess that makes #1429 obsolete:D Supporting the REST Catalog seems like a logical choice for AWS:)
Thanks for all the great work you did here @felixscherz! it's a good reference for anyone looking to use the s3 tables specific apis.
Closing this since pyiceberg can access S3 Tables via the REST catalog client
Hi @kevinjqliu , can we consider reopening this issue considering this: https://github.com/apache/iceberg-python/pull/1429#issuecomment-3423334847
In summary, supporting the native s3table interface will insulate us from bugs like these in glue. Schema evolution has been effectively broken for over a month and we're still waiting for a fix to be deployed to glue rest API. There is no viable workaround today, other than moving away from pyiceberg. We must either wait for AWS to deploy a fix or for pyiceberg to support the native interface
It was my understanding that the glue/rest interface was a stopgap and there was still a plan to support s3tables natively. Is that no longer the case?
Appreciate your thoughts on this 🙏
Hi @srstrickland thanks for raising this.
Seems like this is an issue with the server side implementation of the iceberg rest catalog interface, and it looks like the glue team is aware based on https://github.com/apache/iceberg-python/issues/2511#issuecomment-3402084041
We must either wait for AWS to deploy a fix or for pyiceberg to support the native interface
Do you know if a native s3tables implementation would resolve this issue? #1429 has the commit_table function implemented and we can test to see if it can unblock this particular issue.
It was my understanding that the glue/rest interface was a stopgap and there was still a plan to support s3tables natively. Is that no longer the case?
In my opinion, the rest interface is the preferred over specific implementations. This takes the maintenance burden away from pyiceberg. Client and server should speak the same REST protocol. The pyiceberg client wont need to worry about specific implementations. It can reuse the same logic for the glue rest catalog and every other rest catalog.