Problem statement

TableMetadataBuilder is useful in modifying/creating TableMetadata, and is a core data structure of transaction api. There are already some efforts to create one using derived builder, see https://github.com/apache/iceberg-rust/pull/62 and https://github.com/apache/iceberg-rust/pull/229. While it's easy to implement, I have some concerns about this approach. There are some problems with this approach:

Not easy to review. it's quite easy to forget that we have changed builder api when we do modification to TableMetadata.
Not easy to customize. I've checked the builder in java api, and it has a lot of customization in each method with validations.

The reason I want to be careful with TableMetadataBuilder is that it's a core data structure in iceberg, and we should not expose apis which be misused to construct inconsistent TableMetadata.

Proposal

I want to propose to implement TableMetadataBuilder in a similar approach like java api, rather than using derived builder directly, such as following:

pub struct TableMetadataBuilder(TableMetadata);

impl TableMetadataBuilder {
   pub fn add_schema(mut self, schema: Schema) -> Result<Self> {
       // Check if schema validity
   }

  pub fn add_unbound_partition_spec(mut self, spec: UnboundPartitionSpec) -> Result<Self> { 
   // Binding partition spec
  }

  pub fn build(self) -> Result<TableMetadata> {}
}

Mar 07 '24 01:03 liurenjie1024

cc @Xuanwo @ZENOTME @Fokko @JanKaul @y0psolo

Mar 07 '24 01:03 liurenjie1024

When add_schema has been called, both schemas and current_schema_id will be updated? The same question to partition_specs.

Mar 07 '24 02:03 Xuanwo

When add_schema has been called, both schemas and current_schema_id will be updated? The same question to partition_specs.

Yes, the add_schema process consists of several steps as in java implementation

Check if already existed same schema.
If not, assign schema id to it.
Modify schemas.

Similar things happens for partition_specs.

Other methods such as setBranchSnapshot are also quite complicated.

Mar 07 '24 02:03 liurenjie1024

I agree with @liurenjie1024. Looking at the Java implementation, it seems like there is too much going on here for a derived builder to be a good fit, and a hand-rolled builder would be more appropriate.

Mar 07 '24 08:03 sdd

I want to propose to implement TableMetadataBuilder in a similar approach like java api, rather than using derived builder directly

Implementing the builder manually looks good to me.

Mar 08 '24 03:03 ZENOTME

Cool, seems we reached consensus on this, let's move on to create TableMetadataBuilder.

Mar 10 '24 05:03 liurenjie1024

what about supporting FormatVersion V1 and V2? Will we simply model V2 and provide all optional fields as well for V1 in order to simplify the struct TableMetadata (like it is right now?) - or would there be any reason to change that?

Also how would we handle the update or modifications to an existing TableMetadata. Using TableMetadataBuilder for the controlled construction and implement the update methods directly on the TableMetadata could be one approach?

Mar 10 '24 10:03 marvinlanhenke

what about supporting FormatVersion V1 and V2? Will we simply model V2 and provide all optional fields as well for V1 in order to simplify the struct TableMetadata (like it is right now?) - or would there be any reason to change that?

Sorry, I don't get your point. I think currently our approach is just like what you say, and we don't have plan to change it. Oh, you mean further modification to TableMetadata ? In fact, I mean the evolution of table metadata spec in future.

Also how would we handle the update or modifications to an existing TableMetadata. Using TableMetadataBuilder for the controlled construction and implement the update methods directly on the TableMetadata could be one approach?

That's possible, but that means we need to implement the validation logic in two places, and I think we should avoid such duplication.

Mar 10 '24 12:03 liurenjie1024

[...] and we don't have plan to change it

That's possible, but that means we need to implement the validation logic in two places

Makes sense to me. thanks for clarifying.

Mar 10 '24 12:03 marvinlanhenke

@liurenjie1024, @Fokko do you know if someone is working on a full implementation of the TableMetadataBuilder?

The question is if we need it similar to java with all that's being implemented as part of transactions.

If we need it, and nobody is working on it, I would prepare a PR a cleaned-up Version of what we have in the Iceberg-Catalog over to here. It's mostly about adding all modifications as addition of snapshots, schema evolution, partitioning, ordering, various validations, partition binding, etc.

It would mainly contain the logic in this class: https://github.com/apache/iceberg/blob/604b2bb85ccd33a132e0f588fad1611ab655e2c5/core/src/main/java/org/apache/iceberg/TableMetadata.java#L863 And a few associated classes.

Jul 23 '24 09:07 c-thiel

@liurenjie1024, @Fokko do you know if someone is working on a full implementation of the TableMetadataBuilder?

The question is if we need it similar to java with all that's being implemented as part of transactions.

If we need it, and nobody is working on it, I would prepare a PR a cleaned-up Version of what we have in the Iceberg-Catalog over to here. It's mostly about adding all modifications as addition of snapshots, schema evolution, partitioning, ordering, various validations, partition binding, etc.

It would mainly contain the logic in this class: https://github.com/apache/iceberg/blob/604b2bb85ccd33a132e0f588fad1611ab655e2c5/core/src/main/java/org/apache/iceberg/TableMetadata.java#L863 And a few associated classes.

Hi, @c-thiel AFAIK, nobody is working on this. The plan was to add it one by one when we implement transaction api. It would be greate if you are interested in contributiong.

Jul 23 '24 13:07 liurenjie1024

@liurenjie1024 I would be interested to contribute.

Let me maybe start with partition binding - i.e. binding an UnboundPartitionSpec to a schema - making it a bound PartitionSpec or re-binding to a different schema on updates.

Do you have a strong feeling of where the code should go? As part of an Action or separately? If not, I would just make a proposal.

Jul 24 '24 08:07 c-thiel

@liurenjie1024 I would be interested to contribute.

Let me maybe start with partition binding - i.e. binding an UnboundPartitionSpec to a schema - making it a bound PartitionSpec or re-binding to a different schema on updates.

Do you have a strong feeling of where the code should go? As part of an Action or separately? If not, I would just make a proposal.

Hi, @c-thiel Thanks for your interest. Typical I think small prs would be easier to review and test. That's to say, a prefered approach would be to implement apis for TableMetadataBuilder in one pr, and add one Action with another pr.

Minor nit, putting all actions in one transaction.rs file maybe too large to maintain, and spltting transaction module into files will be easier to maintain.

Jul 24 '24 12:07 liurenjie1024

Discussion: Design of `TableMetadataBuilder`.

Problem statement

Proposal