trino Add support for creating an Iceberg table from existing table content

Use case

The content directory (data & metadata) corresponding to an Iceberg table exist on the object storage, but the table has been removed from the metastore. Offer a way to recreate the table.

Existing workaround:

create the table in Hive as an external table and point the external_location to the data directory of the table (I'm not sure if this plays well with Iceberg delete files )
execute a CTAS statement to create a new Iceberg table

The outcome of the workaround is that all the existing content of the table has been copied to the newly created Iceberg table and also that the new Iceberg table lacks any history information.

Request

Provide a way to create an Iceberg table in Trino from existing content.

Feedback from @electrum:

Maybe need an explicit external flag (to point out in the CREATE TABLE statement that we're creating the table out of existing content).

Aug 08 '22 19:08 findinpath

Related Iceberg API: https://github.com/apache/iceberg/pull/3851

Aug 09 '22 15:08 alexjo2144

I'm sure you know this, but this is how you add an existing Delta table. It's a very nice feature to have. https://trino.io/docs/current/connector/delta-lake.html#creating-tables

Aug 09 '22 15:08 tnatssb

create the table in Hive as an external table and point the external_location to the data directory of the table (I'm not sure if this plays well with Iceberg delete files )

This works if you run expire_snapshots and remove_orphan_files with an expiration window of 0s first.

One case we may need to handle that's different from delta is that there are situations where we won't be able to derive everything we need to know from the file system layer. That's why the Iceberg API also takes metadataFileLocation as a parameter. For example, lets say the last snapshot in the table is # 6, and two writers tried committing 6 at the same time. It's valid for the writer that looses the commit to leave a # 6 snapshot file around, expecting it to get cleaned up later by gc.

Aug 09 '22 16:08 alexjo2144

create the table in Hive as an external table and point the external_location to the data directory of the table (I'm not sure if this plays well with Iceberg delete files )

it won't work with non-deletion files either, since hive connector won't map columns by id correctly

Sep 19 '22 12:09 findepi

A table needs to be registered with current metadata file path (and perhaps with previous one too)

Design question: metadata path

Should a user be responsible for providing the path to the "current" metadata file?
- that's more explicit
- but also a burden on a user; they may not know actually and would just take the last metadata file from the table directory
Or, should the operation pick what looks like the last metadata file in the table?
- assuming they follow the naming convention (with sequence numbers), this would work well, and we may initially fail when some files don't follow convention

Design question: user's interface in SQL What should be the SQL statement that does that job?

eg CREATE TABLE WITH ... + some new table attribute, like existing_location?
a procedure?

Sep 19 '22 12:09 findepi

but also a burden on a user; they may not know actually and would just take the last metadata file from the table directory

To me, this is a administrator type of function and wouldn't be done by someone that doesn't understand the underlying components of Iceberg. To me, it would be very similar to the delta lake create table for existing data. create table (dummy) with (location=metadata file);

Sep 19 '22 13:09 tnatssb

Design question: user's interface in SQL What should be the SQL statement that does that job?

Some users have complained about the CREATE TABLE WITH (...) syntax that we used for Delta Lake. Specifically that if you have the path wrong the operation still passes but you've created a table with one dummy column. https://github.com/trinodb/trino/issues/13568

Sep 19 '22 14:09 alexjo2144

I have couple of questions

At the same time can two table points to same metadata location?
If Point 1 is true, What happens if we insert data into old table. Is newly created table will have the those new data too?
What happens if the location provided by user does not have metadata file or provided location itself is wrong/invalid?

Sep 23 '22 13:09 krvikash

There are a few possible ways to implement this feature from a user's perspective:

create the table with a specified "directory" location (similar to delta lake: https://trino.io/docs/current/connector/delta-lake.html#creating-tables)

CREATE TABLE iceberg.default.my_table (
  dummy bigint
)
WITH (
  location = 's3://my-bucket/my-table/'
)

create table with the location of the metadata file to be used for the current version of the table

CREATE TABLE iceberg.default.my_table (
  dummy bigint
)
WITH (
  table_current_metadata_location = 's3://my-bucket/my-table/metadata/xxx-xxxx-xxxx.json'
)

This option comes at the cost of introducing a new parameter for the CREATE TABLE statement which can be used only in this rather exotic use case.

register the new table through a stored procedure (see https://trino.io/docs/current/connector/iceberg.html#rolling-back-to-a-previous-snapshot)

CALL iceberg.system.register_table('schema-name', 'table-name', metadata-location/table-base-directory-location)

This option comes with the freedom to choose (if necessary) meaningful parameter names for the procedure. However this approach is different from the approach used in Delta Lake for creating tables which may confuse the Trino users (see https://trino.io/docs/current/connector/delta-lake.html#creating-tables ).

Sep 23 '22 13:09 findinpath

create table with the location of the metadata file to be used for the current version of the table

For Option 2, User needs to go and look for the latest metastore file and provide it at the CREATE table statement. But It gives the flexibility to the user to choose any/outdated metastore file (Not sure if this could be valid use case).

For Option 2 OR 3, We might need to change the way how delta table gets created using existing metadata to make it in-sync with iceberg.

Sep 23 '22 13:09 krvikash

@electrum @martint @alexjo2144 @phd3 @losipiuk please see https://github.com/trinodb/trino/issues/13552#issuecomment-1250984973 and newer comments

Sep 23 '22 14:09 findepi

At the same time can two table points to same metadata location?

If Point 1 is true, What happens if we insert data into old table. Is newly created table will have the those new data too?

What happens if the location provided by user does not have metadata file or provided location itself is wrong/invalid?

[Conclusion] Point 1 -> Yes, It could be Point 2 -> Newly created table won't have the new data Point 3 -> Throw exception

Sep 23 '22 18:09 krvikash

register the new table through a stored procedure

I'm strongly in favor of this option. Mostly because it makes the distinction between creating a new table at a specific location unambiguously different from registering an existing table with the catalog. This is the problem users have had with Delta, a small typo in the path definition does not result in a failure but instead has unexpected consequences.

The other big one is that it makes it easy to add more parameters to the register procedure if we need to later. For example, if we can't programatically decide what the most recent snapshot file is, a user could provide it. That is much easier to do with the procedure.

Sep 26 '22 14:09 alexjo2144

I agree with @alexjo2144 on this. The behavior is entirely different and none of the properties are relevant, so reusing CREATE TABLE doesn't make sense.

Sep 26 '22 17:09 electrum

I agree with @alexjo2144 too. Seems we have agreement, awesome. Let's call this new procedure: iceberg.system.register_table.

Sep 27 '22 10:09 findepi

Thanks @alexjo2144 | @electrum | @findepi, I will proceed with the new procedure: iceberg.system.register_table.

Sep 27 '22 10:09 krvikash

https://github.com/trinodb/trino/pull/14489 shows how useful this feature is. Without register_table being added in https://github.com/trinodb/trino/pull/14375, it would be pretty much impossible to verify large datasets query plans on CI.

Oct 06 '22 02:10 findepi

trino trino copied to clipboard

Add support for creating an Iceberg table from existing table content

Use case

Request

trino
trino copied to clipboard