Discussion: Design of `TableMetadataBuilder`.
Problem statement
TableMetadataBuilder is useful in modifying/creating TableMetadata, and is a core data structure of transaction api. There are already some efforts to create one using derived builder, see https://github.com/apache/iceberg-rust/pull/62 and https://github.com/apache/iceberg-rust/pull/229. While it's easy to implement, I have some concerns about this approach. There are some problems with this approach:
- Not easy to review. it's quite easy to forget that we have changed builder api when we do modification to
TableMetadata. - Not easy to customize. I've checked the builder in java api, and it has a lot of customization in each method with validations.
The reason I want to be careful with TableMetadataBuilder is that it's a core data structure in iceberg, and we should not expose apis which be misused to construct inconsistent TableMetadata.
Proposal
I want to propose to implement TableMetadataBuilder in a similar approach like java api, rather than using derived builder directly, such as following:
pub struct TableMetadataBuilder(TableMetadata);
impl TableMetadataBuilder {
pub fn add_schema(mut self, schema: Schema) -> Result<Self> {
// Check if schema validity
}
pub fn add_unbound_partition_spec(mut self, spec: UnboundPartitionSpec) -> Result<Self> {
// Binding partition spec
}
pub fn build(self) -> Result<TableMetadata> {}
}
cc @Xuanwo @ZENOTME @Fokko @JanKaul @y0psolo
When add_schema has been called, both schemas and current_schema_id will be updated? The same question to partition_specs.
When
add_schemahas been called, bothschemasandcurrent_schema_idwill be updated? The same question topartition_specs.
Yes, the add_schema process consists of several steps as in java implementation
- Check if already existed same schema.
- If not, assign schema id to it.
- Modify schemas.
Similar things happens for partition_specs.
Other methods such as setBranchSnapshot are also quite complicated.
I agree with @liurenjie1024. Looking at the Java implementation, it seems like there is too much going on here for a derived builder to be a good fit, and a hand-rolled builder would be more appropriate.
I want to propose to implement TableMetadataBuilder in a similar approach like java api, rather than using derived builder directly
Implementing the builder manually looks good to me.
Cool, seems we reached consensus on this, let's move on to create TableMetadataBuilder.
what about supporting FormatVersion V1 and V2? Will we simply model V2 and provide all optional fields as well for V1 in order to simplify the struct TableMetadata (like it is right now?) - or would there be any reason to change that?
Also how would we handle the update or modifications to an existing TableMetadata. Using TableMetadataBuilder for the controlled construction and implement the update methods directly on the TableMetadata could be one approach?
what about supporting FormatVersion V1 and V2? Will we simply model V2 and provide all optional fields as well for V1 in order to simplify the struct TableMetadata (like it is right now?) - or would there be any reason to change that?
Sorry, I don't get your point. I think currently our approach is just like what you say, and we don't have plan to change it. Oh, you mean further modification to TableMetadata ? In fact, I mean the evolution of table metadata spec in future.
Also how would we handle the update or modifications to an existing TableMetadata. Using TableMetadataBuilder for the controlled construction and implement the update methods directly on the TableMetadata could be one approach?
That's possible, but that means we need to implement the validation logic in two places, and I think we should avoid such duplication.
[...] and we don't have plan to change it
That's possible, but that means we need to implement the validation logic in two places
Makes sense to me. thanks for clarifying.
@liurenjie1024, @Fokko do you know if someone is working on a full implementation of the TableMetadataBuilder?
The question is if we need it similar to java with all that's being implemented as part of transactions.
If we need it, and nobody is working on it, I would prepare a PR a cleaned-up Version of what we have in the Iceberg-Catalog over to here. It's mostly about adding all modifications as addition of snapshots, schema evolution, partitioning, ordering, various validations, partition binding, etc.
It would mainly contain the logic in this class: https://github.com/apache/iceberg/blob/604b2bb85ccd33a132e0f588fad1611ab655e2c5/core/src/main/java/org/apache/iceberg/TableMetadata.java#L863 And a few associated classes.
@liurenjie1024, @Fokko do you know if someone is working on a full implementation of the TableMetadataBuilder?
The question is if we need it similar to java with all that's being implemented as part of transactions.
If we need it, and nobody is working on it, I would prepare a PR a cleaned-up Version of what we have in the Iceberg-Catalog over to here. It's mostly about adding all modifications as addition of snapshots, schema evolution, partitioning, ordering, various validations, partition binding, etc.
It would mainly contain the logic in this class: https://github.com/apache/iceberg/blob/604b2bb85ccd33a132e0f588fad1611ab655e2c5/core/src/main/java/org/apache/iceberg/TableMetadata.java#L863 And a few associated classes.
Hi, @c-thiel AFAIK, nobody is working on this. The plan was to add it one by one when we implement transaction api. It would be greate if you are interested in contributiong.
@liurenjie1024 I would be interested to contribute.
Let me maybe start with partition binding - i.e. binding an UnboundPartitionSpec to a schema - making it a bound PartitionSpec or re-binding to a different schema on updates.
Do you have a strong feeling of where the code should go? As part of an Action or separately? If not, I would just make a proposal.
@liurenjie1024 I would be interested to contribute.
Let me maybe start with partition binding - i.e. binding an
UnboundPartitionSpecto a schema - making it a boundPartitionSpecor re-binding to a different schema on updates.Do you have a strong feeling of where the code should go? As part of an
Actionor separately? If not, I would just make a proposal.
Hi, @c-thiel Thanks for your interest. Typical I think small prs would be easier to review and test. That's to say, a prefered approach would be to implement apis for TableMetadataBuilder in one pr, and add one Action with another pr.
Minor nit, putting all actions in one transaction.rs file maybe too large to maintain, and spltting transaction module into files will be easier to maintain.