iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Support S3 Access Points with Access Point to Bucket mapping

Open JGynther opened this issue 1 year ago • 7 comments

Feature Request / Improvement

S3 Access Points are a way to scale data access by providing additional endpoints to run S3 object operations against.

Currently trying to create a StaticTable from table metadata using a S3 Access Point alias works as expected: it's able to query S3 for the metadata just fine and create the table object.

However, trying to run any further queries (e.g. scan) will fail. PyIceberg correctly finds the table location from the metadata as the bucket underlying an access point. Trying to run queries against this will however fail with a 403, assuming you have only configured access to the bucket using said access point. This prevents using PyIceberg via S3 access points.

The ideal solution would be to implement something like the workaround in the Java implementation, allowing you to map bucket names to access points.

JGynther avatar Feb 20 '24 21:02 JGynther

Could you share the full exception you get when you run the scan query?

hussein-awala avatar Feb 20 '24 22:02 hussein-awala

Here is my minimal code example (that fails):

from pyiceberg.table import StaticTable

# Latest metadata file
object = "iceberg/metadata/00068-b5e701c2-1520-4ff5-9484-aef7ba257d6f.metadata.json"

table = StaticTable.from_metadata(
    f"s3://<name>-<number>-s3alias/{object}"
)

connection = table.scan(limit=100).to_duckdb("test")

And the expection for running this:

Traceback (most recent call last):
  File "/Users/gynther/minimal-data-mesh/s3accesspoint.py", line 9, in <module>
    connection = table.scan(limit=100).to_duckdb("test")
  File "/Users/gynther/Library/Python/3.9/lib/python/site-packages/pyiceberg/table/__init__.py", line 904, in to_duckdb
    con.register(table_name, self.to_arrow())
  File "/Users/gynther/Library/Python/3.9/lib/python/site-packages/pyiceberg/table/__init__.py", line 889, in to_arrow
    self.plan_files(),
  File "/Users/gynther/Library/Python/3.9/lib/python/site-packages/pyiceberg/table/__init__.py", line 831, in plan_files
    for manifest_file in snapshot.manifests(io)
  File "/Users/gynther/Library/Python/3.9/lib/python/site-packages/pyiceberg/table/snapshots.py", line 107, in manifests
    return list(read_manifest_list(file))
  File "/Users/gynther/Library/Python/3.9/lib/python/site-packages/pyiceberg/manifest.py", line 371, in read_manifest_list
    with AvroFile[ManifestFile](
  File "/Users/gynther/Library/Python/3.9/lib/python/site-packages/pyiceberg/avro/file.py", line 168, in __enter__
    with self.input_file.open() as f:
  File "/Users/gynther/Library/Python/3.9/lib/python/site-packages/pyiceberg/io/pyarrow.py", line 230, in open
    input_file = self._filesystem.open_input_file(self._path)
  File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When reading information for key 'iceberg/metadata/snap-8568433207443008232-1-e58de399-3ee1-45a6-b4df-a7c0e33bccbd.avro' in bucket '<actual underlying bucket name here>': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

JGynther avatar Feb 20 '24 22:02 JGynther

I think the problem is in Pyarrow S3FileSystem, I found this open issue https://issues.apache.org/jira/browse/ARROW-9669 (moved to https://github.com/apache/arrow/issues/25727).

It's able to query S3 for the metadata just fine and create the table object.

That's because it uses the fsspec file system.

Let me check if it's a bug in PyArrowFile or we need to fix that in pyarrow.

hussein-awala avatar Feb 20 '24 22:02 hussein-awala

I found the same issue. However using Access Point S3 alias it seems to work just fine. Here is a minimal PyArrow example that reads a file:

from pyarrow import fs

object = "iceberg/metadata/snap-8568433207443008232-1-e58de399-3ee1-45a6-b4df-a7c0e33bccbd.avro"
uri = f"<name>-<number>-s3alias/{object}"

s3 = fs.S3FileSystem(region="eu-north-1")
with s3.open_input_file(uri) as file:
    print(file.readall())

I believe this is due to access point aliases using the same endpoint as normal s3 operations, hence the described issue does not affect reading via an alias. Compare e.g. blue-access-point-1111222223333.s3-accesspoint.us-east-2.amazonaws.com vs. blue-access-point-razthp3ehn-s3alias.s3.us-east-2.amazonaws.com and example-bucket.s3.us-east-2.amazonaws.com.

Edit: more details

JGynther avatar Feb 20 '24 22:02 JGynther

Thinking about this further I see two options:

  1. Change the way metadata is read when the target is an access point alias, either by checking for the "-s3alias" suffix (which is reserved so matching that will 100 % be an access point) or through an explicit flag like overrideS3Bucket
  2. Add an explicit flag to the reading of a table to override the bucket name when making the actual S3FileSystem calls

I'll see if I can make a quick-and-dirty solution work or if there are additional caveats than simply replacing the bucket name.

JGynther avatar Feb 21 '24 10:02 JGynther

Finally had a chance to poke this.

To me it seems that there is no easy way out to implement this. When creating and scanning a StaticTable the actual location of a particular file is read based on metadata at least few times: initial reading of metadata and again when manifest lists are turned into ManifestEntry objects for the data scan. It is not enough just to replace the locations while/after reading the initial metadata.

Maybe a reasonable place to implement would be in the actual file io with similar parameters that are already accepted for other things. This would not work out of the box either as the PyArrow S3FileSystem does not support replacing the bucket name.

It could work by creating a light wrapper around the S3FileSystem to replace bucket name for files coming in based on a mapping like: ("examplebucketname1", "replacedname-s3alias"). Of course then the question is should this instead be a request on the PyArrow side.

Another option would be decoupling filename/key from the location by respecting e.g. the metadata location parameter, but this would require changing a lot and probably not a good approach.

JGynther avatar Mar 04 '24 06:03 JGynther

Testing a very simple wrapper like:

from pyarrow.fs import S3FileSystem

class WrappedS3FileSystem(S3FileSystem):
	def __init__(self, bucket_override, **kwargs):
		super().__init__(**kwargs)
		self.override = bucket_override
	
	def open_input_file(self, path):
		for bucket in self.override:
			path = path.replace(bucket[0], bucket[1], 1)
		
		return super().open_input_file(path)

Configured like so:

table = StaticTable.from_metadata(
    "s3://accesspoint-number-s3alias/path/to/table",
    {
        "s3.bucket_override": [
            (
                "actualbucketnamehere",
                "accesspoint-number-s3alias",
            )
        ],
    },
)

Allows StaticTable.scan to properly create the DataScan object. Trying to query the data based on any of the methods that use to_arrow would still fail as that uses the PyArrow Dataset Scanner instead of S3FileSystem. One could however manually handle this from the DataScan.plan_files.

JGynther avatar Mar 06 '24 12:03 JGynther

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Sep 03 '24 00:09 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Sep 18 '24 00:09 github-actions[bot]