feat: Direct iceberg table reading
This adds support to read iceberg tables directly from a specific metadata file without the need for a catalog (although, a catalog may be present).
At a minimum, this should be a very useful tool for debugging iceberg issues. In some cases, it may be the best way to read iceberg data as a catalog may not be supported. For example, clickhouse iceberg integration uses direct access without catalog support (no table name, no namespace, etc):
$ aws s3 ls --recursive --no-sign-request s3://datasets-documentation/ookla/iceberg/
2024-01-22 07:46:40 0 ookla/iceberg/
2024-01-22 08:48:38 611058150 ookla/iceberg/data/7XNeNQ/year_month_year=2019/20240122_164644_00156_m96dt-a29b72df-0432-46db-8194-7cb911f08800.parquet
2024-01-22 08:47:14 756200550 ookla/iceberg/data/87-7xw/year_month_year=2020/20240122_164644_00156_m96dt-63c79f23-dd64-4a7e-890b-700b722b5d03.parquet
2024-01-22 08:47:07 767259012 ookla/iceberg/data/Mmyt8A/year_month_year=2021/20240122_164644_00156_m96dt-bb51598a-6f86-4d19-9717-bcef163a4f05.parquet
2024-01-22 08:47:01 781589111 ookla/iceberg/data/X9Wyog/year_month_year=2022/20240122_164644_00156_m96dt-d27f1b84-bb50-4205-b323-990a32e18ff6.parquet
2024-01-22 08:47:01 836014231 ookla/iceberg/data/wRhLaA/year_month_year=2023/20240122_164644_00156_m96dt-e5ce13ef-f40c-4834-b032-39e403121d3c.parquet
2024-01-22 08:25:30 1968 ookla/iceberg/metadata/00000-6bfbd5a5-c431-4a41-98c8-12328da25947.metadata.json
2024-01-22 08:49:55 3107 ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json
2024-01-22 08:49:53 8347 ookla/iceberg/metadata/a3a81488-f4ec-42ad-9819-54527e7f6385-m0.avro
2024-01-22 08:49:54 4280 ookla/iceberg/metadata/snap-8326954415243093563-1-a3a81488-f4ec-42ad-9819-54527e7f6385.avro
SELECT
*
FROM
iceberg('https://datasets-documentation.s3.eu-west-3.amazonaws.com/ookla/iceberg/')
https://clickhouse.com/blog/exploring-global-internet-speeds-with-apache-iceberg-clickhouse https://clickhouse.com/docs/en/sql-reference/table-functions/iceberg
With this PR, the equivalent in Deephaven would be:
from deephaven.experimental import iceberg
ookla = iceberg.read_static_table(
"s3://datasets-documentation/ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json"
)
(For ease of use, the bit more verbose version works out-of-the-box without relying on implicit AWS credentials:
from deephaven.experimental import iceberg, s3
from datetime import timedelta
ookla = iceberg.read_static_table(
"s3://datasets-documentation/ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json",
instructions=iceberg.IcebergInstructions(
data_instructions=s3.S3Instructions(
region_name="eu-west-3",
anonymous_access=True,
read_timeout=timedelta(seconds=10),
)
),
)
)
There's potential to extend this support to point to the root of the table location (like clickhouse supports) as opposed to a specific metadata file, ie, s3://datasets-documentation/ookla/iceberg/, but that would take some additional logic.
This is missing documentation, as I want to make sure there's some agreement on the interfaces before proceeding.
This is partially related to #5868, at least for providing a refactoring of the TableDefinition logic and exposing it to end users for the static entrypoints.