trino
trino copied to clipboard
Add support for creating an Iceberg table from existing table content
Use case
The content directory (data & metadata) corresponding to an Iceberg table exist on the object storage, but the table has been removed from the metastore. Offer a way to recreate the table.
Existing workaround:
- create the table in Hive as an external table and point the
external_location
to thedata
directory of the table (I'm not sure if this plays well with Iceberg delete files ) - execute a CTAS statement to create a new Iceberg table
The outcome of the workaround is that all the existing content of the table has been copied to the newly created Iceberg table and also that the new Iceberg table lacks any history information.
Request
Provide a way to create an Iceberg table in Trino from existing content.
Feedback from @electrum:
Maybe need an explicit external flag (to point out in the
CREATE TABLE
statement that we're creating the table out of existing content).
Related Iceberg API: https://github.com/apache/iceberg/pull/3851
I'm sure you know this, but this is how you add an existing Delta table. It's a very nice feature to have. https://trino.io/docs/current/connector/delta-lake.html#creating-tables
create the table in Hive as an external table and point the external_location to the data directory of the table (I'm not sure if this plays well with Iceberg delete files )
This works if you run expire_snapshots
and remove_orphan_files
with an expiration window of 0s first.
One case we may need to handle that's different from delta is that there are situations where we won't be able to derive everything we need to know from the file system layer. That's why the Iceberg API also takes metadataFileLocation
as a parameter. For example, lets say the last snapshot in the table is # 6, and two writers tried committing 6 at the same time. It's valid for the writer that looses the commit to leave a # 6 snapshot file around, expecting it to get cleaned up later by gc.
- create the table in Hive as an external table and point the
external_location
to thedata
directory of the table (I'm not sure if this plays well with Iceberg delete files )
it won't work with non-deletion files either, since hive connector won't map columns by id correctly
A table needs to be registered with current metadata file path (and perhaps with previous one too)
Design question: metadata path
- Should a user be responsible for providing the path to the "current" metadata file?
- that's more explicit
- but also a burden on a user; they may not know actually and would just take the last metadata file from the table directory
- Or, should the operation pick what looks like the last metadata file in the table?
- assuming they follow the naming convention (with sequence numbers), this would work well, and we may initially fail when some files don't follow convention
Design question: user's interface in SQL What should be the SQL statement that does that job?
- eg CREATE TABLE WITH ... + some new table attribute, like existing_location?
- a procedure?
but also a burden on a user; they may not know actually and would just take the last metadata file from the table directory
To me, this is a administrator type of function and wouldn't be done by someone that doesn't understand the underlying components of Iceberg. To me, it would be very similar to the delta lake create table for existing data. create table (dummy) with (location=metadata file);
Design question: user's interface in SQL What should be the SQL statement that does that job?
Some users have complained about the CREATE TABLE WITH (...)
syntax that we used for Delta Lake. Specifically that if you have the path wrong the operation still passes but you've created a table with one dummy
column. https://github.com/trinodb/trino/issues/13568
I have couple of questions
- At the same time can two table points to same metadata location?
- If Point 1 is true, What happens if we insert data into old table. Is newly created table will have the those new data too?
- What happens if the location provided by user does not have metadata file or provided location itself is wrong/invalid?
There are a few possible ways to implement this feature from a user's perspective:
- create the table with a specified "directory" location (similar to delta lake: https://trino.io/docs/current/connector/delta-lake.html#creating-tables)
CREATE TABLE iceberg.default.my_table (
dummy bigint
)
WITH (
location = 's3://my-bucket/my-table/'
)
- create table with the location of the metadata file to be used for the current version of the table
CREATE TABLE iceberg.default.my_table (
dummy bigint
)
WITH (
table_current_metadata_location = 's3://my-bucket/my-table/metadata/xxx-xxxx-xxxx.json'
)
This option comes at the cost of introducing a new parameter for the CREATE TABLE
statement which can be used only in this rather exotic use case.
- register the new table through a stored procedure (see https://trino.io/docs/current/connector/iceberg.html#rolling-back-to-a-previous-snapshot)
CALL iceberg.system.register_table('schema-name', 'table-name', metadata-location/table-base-directory-location)
This option comes with the freedom to choose (if necessary) meaningful parameter names for the procedure. However this approach is different from the approach used in Delta Lake for creating tables which may confuse the Trino users (see https://trino.io/docs/current/connector/delta-lake.html#creating-tables ).
- create table with the location of the metadata file to be used for the current version of the table
For Option 2, User needs to go and look for the latest metastore file and provide it at the CREATE table statement. But It gives the flexibility to the user to choose any/outdated metastore file (Not sure if this could be valid use case).
For Option 2 OR 3, We might need to change the way how delta table gets created using existing metadata to make it in-sync with iceberg.
@electrum @martint @alexjo2144 @phd3 @losipiuk please see https://github.com/trinodb/trino/issues/13552#issuecomment-1250984973 and newer comments
- At the same time can two table points to same metadata location?
- If Point 1 is true, What happens if we insert data into old table. Is newly created table will have the those new data too?
- What happens if the location provided by user does not have metadata file or provided location itself is wrong/invalid?
[Conclusion] Point 1 -> Yes, It could be Point 2 -> Newly created table won't have the new data Point 3 -> Throw exception
register the new table through a stored procedure
I'm strongly in favor of this option. Mostly because it makes the distinction between creating a new table at a specific location unambiguously different from registering an existing table with the catalog. This is the problem users have had with Delta, a small typo in the path definition does not result in a failure but instead has unexpected consequences.
The other big one is that it makes it easy to add more parameters to the register procedure if we need to later. For example, if we can't programatically decide what the most recent snapshot file is, a user could provide it. That is much easier to do with the procedure.
I agree with @alexjo2144 on this. The behavior is entirely different and none of the properties are relevant, so reusing CREATE TABLE
doesn't make sense.
I agree with @alexjo2144 too. Seems we have agreement, awesome.
Let's call this new procedure: iceberg.system.register_table
.
Thanks @alexjo2144 | @electrum | @findepi, I will proceed with the new procedure: iceberg.system.register_table
.
https://github.com/trinodb/trino/pull/14489 shows how useful this feature is.
Without register_table
being added in https://github.com/trinodb/trino/pull/14375, it would be pretty much impossible to verify large datasets query plans on CI.