deephaven-core icon indicating copy to clipboard operation
deephaven-core copied to clipboard

feat: Direct iceberg table reading

Open devinrsmith opened this issue 1 year ago • 2 comments

This adds support to read iceberg tables directly from a specific metadata file without the need for a catalog (although, a catalog may be present).

At a minimum, this should be a very useful tool for debugging iceberg issues. In some cases, it may be the best way to read iceberg data as a catalog may not be supported. For example, clickhouse iceberg integration uses direct access without catalog support (no table name, no namespace, etc):

$ aws s3 ls --recursive --no-sign-request s3://datasets-documentation/ookla/iceberg/
2024-01-22 07:46:40          0 ookla/iceberg/
2024-01-22 08:48:38  611058150 ookla/iceberg/data/7XNeNQ/year_month_year=2019/20240122_164644_00156_m96dt-a29b72df-0432-46db-8194-7cb911f08800.parquet
2024-01-22 08:47:14  756200550 ookla/iceberg/data/87-7xw/year_month_year=2020/20240122_164644_00156_m96dt-63c79f23-dd64-4a7e-890b-700b722b5d03.parquet
2024-01-22 08:47:07  767259012 ookla/iceberg/data/Mmyt8A/year_month_year=2021/20240122_164644_00156_m96dt-bb51598a-6f86-4d19-9717-bcef163a4f05.parquet
2024-01-22 08:47:01  781589111 ookla/iceberg/data/X9Wyog/year_month_year=2022/20240122_164644_00156_m96dt-d27f1b84-bb50-4205-b323-990a32e18ff6.parquet
2024-01-22 08:47:01  836014231 ookla/iceberg/data/wRhLaA/year_month_year=2023/20240122_164644_00156_m96dt-e5ce13ef-f40c-4834-b032-39e403121d3c.parquet
2024-01-22 08:25:30       1968 ookla/iceberg/metadata/00000-6bfbd5a5-c431-4a41-98c8-12328da25947.metadata.json
2024-01-22 08:49:55       3107 ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json
2024-01-22 08:49:53       8347 ookla/iceberg/metadata/a3a81488-f4ec-42ad-9819-54527e7f6385-m0.avro
2024-01-22 08:49:54       4280 ookla/iceberg/metadata/snap-8326954415243093563-1-a3a81488-f4ec-42ad-9819-54527e7f6385.avro
SELECT
  *
FROM
  iceberg('https://datasets-documentation.s3.eu-west-3.amazonaws.com/ookla/iceberg/')

https://clickhouse.com/blog/exploring-global-internet-speeds-with-apache-iceberg-clickhouse https://clickhouse.com/docs/en/sql-reference/table-functions/iceberg

With this PR, the equivalent in Deephaven would be:

from deephaven.experimental import iceberg

ookla = iceberg.read_static_table(
    "s3://datasets-documentation/ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json"
)

(For ease of use, the bit more verbose version works out-of-the-box without relying on implicit AWS credentials:

from deephaven.experimental import iceberg, s3
from datetime import timedelta

ookla = iceberg.read_static_table(
    "s3://datasets-documentation/ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json",
    instructions=iceberg.IcebergInstructions(
        data_instructions=s3.S3Instructions(
            region_name="eu-west-3",
            anonymous_access=True,
            read_timeout=timedelta(seconds=10),
        )
    ),
)

)

There's potential to extend this support to point to the root of the table location (like clickhouse supports) as opposed to a specific metadata file, ie, s3://datasets-documentation/ookla/iceberg/, but that would take some additional logic.

devinrsmith avatar Aug 01 '24 00:08 devinrsmith

This is missing documentation, as I want to make sure there's some agreement on the interfaces before proceeding.

devinrsmith avatar Aug 01 '24 00:08 devinrsmith

This is partially related to #5868, at least for providing a refactoring of the TableDefinition logic and exposing it to end users for the static entrypoints.

devinrsmith avatar Aug 01 '24 01:08 devinrsmith