iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

REST catalog, S3tables with botocore session

Open Flogue opened this issue 2 months ago • 0 comments

Apache Iceberg version

0.10.0

Please describe the bug 🐞

Since 0.10.0, it is now possible to use a botocore session for a rest catalog, so:

import io
import os

import pandas as pd
import pyarrow as pa

from boto3 import Session
from pyiceberg.catalog import load_catalog

boto3_session = Session(profile_name='a_profile', region_name='us-east-1')

catalog = load_catalog(
        "catalog",
        type="rest",
        botocore_session=boto3_session._session,
        warehouse="arn:aws:s3tables:us-east-1:XXXXXXXXXXX:bucket/a_bucket",
        uri=f"https://s3tables.us-east-1.amazonaws.com/iceberg",
        **{
            "rest.sigv4-enabled": "true",
            "rest.signing-name": "s3tables",
            "rest.signing-region": "us-east-1"
        })

table = catalog.load_table("namespace.a_table")

json_string = "[{\"data\":\"000000000000\", ...}]"
df = pd.read_json(io.StringIO(json_string), orient='records')

arrow_table = pa.Table.from_pandas(df=df, schema=table.schema().as_arrow())

table.overwrite(arrow_table)

It works until we ".overwrite()":

OSError: When reading information for key 'metadata/snap-6778585584222594295-0-3ae9518f-fd1c-488f-b3d2-4ca1724317a1.avro' in bucket '2c8e7acb-67a1-4dc9-8ym9eg38966b8bazzfjn487w5o9wruse1b--table-s3': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

To "fix" it, we can do:

boto3_session = Session(profile_name='a_profile', region_name='us-east-1')

catalog = load_catalog(
        "catalog",
        type="rest",
        botocore_session=boto3_session._session,
        warehouse="arn:aws:s3tables:us-east-1:XXXXXXXXXXX:bucket/a_bucket",
        uri=f"https://s3tables.us-east-1.amazonaws.com/iceberg",
        **{
            "rest.sigv4-enabled": "true",
            "rest.signing-name": "s3tables",
            "rest.signing-region": "us-east-1"
        })

table = catalog.load_table("namespace.a_table")

json_string = "[{\"data\":\"000000000000\", ...}]"
df = pd.read_json(io.StringIO(json_string), orient='records')

arrow_table = pa.Table.from_pandas(df=df, schema=table.schema().as_arrow())

credentials = boto3_session.get_credentials().get_frozen_credentials()
os.environ["AWS_ACCESS_KEY_ID"] = credentials.access_key
os.environ["AWS_SECRET_ACCESS_KEY"] = credentials.secret_key
if credentials.token:
    os.environ["AWS_SESSION_TOKEN"] = credentials.token
table.overwrite(arrow_table)

which works but defeats the purpose.

We can access .schema() and such. So it seems the overwrite method is not using the proper SigV4Adapter (pyiceberg/catalog/rest/init.py).

Willingness to contribute

  • [ ] I can contribute a fix for this bug independently
  • [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • [ ] I cannot contribute a fix for this bug at this time

Flogue avatar Oct 23 '25 12:10 Flogue