Schema: Allow field name `foo.bar` even if struct foo->bar is present
Currently due to the way name-to-id is build, we cannot have points in columnames if it collides with a struct.
The following schema fails to build:
let schema = Schema::builder()
.with_schema_id(1)
.with_fields(vec![
NestedField::required(1, "foo", Primitive(PrimitiveType::String)).into(),
NestedField::required(2, "foo.bar", Primitive(PrimitiveType::Int)).into(),
NestedField::required(
3,
"foo",
Type::Struct(StructType::new(vec![NestedField::required(
4,
"bar",
Primitive(PrimitiveType::Int),
)
.into()])),
)
.into(),
])
.build()
.unwrap();
By prohibiting this we follow the Java implementation. There is nothing in the iceberg spec that prohibits these names, so I think we should allow them. For column names as a user I expect the same behaviors as for namespaces or databases with points - it needs to be escaped. As escaping depends on the query engine, and iceberg-rust as well as iceberg java has no SQL-parser, those libraries should not take away the option from the engine or make that decision in their stead.
I propose to change the representation of a colname to Vec<String> instead of just "String with points". It would also make accessor compatible between schemas - even if we decide to stick keep this artificial restriction.
CC @Fokko , @nastra , @Xuanwo
Allowing such a case would also require a change on the table scan api, e.g. we need a way to allow user to tell use what foo.bar actually means.
Also cc @rdblue
FWIW. Can we introduce a structure similar to TableIdent to represent field paths, named 'FieldPath'?
- We could modify the elements of selected
TableScan::column_namesfromstringtoFieldPath(and provide some utilities to simplify the construction ofFieldPath). - Change the key in
name_to_idtoFieldPath. - Perhaps the logic for looking up fields via
FieldPathwould also need to be modified.
This is an implementation detail that is not part of the spec. But I don't think it is worth bothering to enable both ["foo.bar"] and ["foo", "bar"] identifiers in the same schema. That's confusing for users, who will almost certainly be confused by tables with such odd structures.
The reference implementation has been this way for about 7 years and no one has every complained or, to my knowledge, hit a problem with this in practice. I highly recommend focusing time and effort on other improvements.
This is not so much about this specific use case - which I also don't care about much either, but about having two different representations for the same entity. Took me a while to formulate it so clearly. There are many lines of code in different files just to work around this problem.
Let's assume for example we want to add a field to a schema, then the point representation in the PartitionSpec is not compatible with the one in the Schema. Java would just throw an exception in this case. Because we have two distinct representations for the same entity, and those representations are not bijective, we need extra code to handle conversion.
https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/core/src/main/java/org/apache/iceberg/SchemaUpdate.java#L97-L103 also https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/core/src/main/java/org/apache/iceberg/SchemaUpdate.java#L112 several lines of docs here: https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/api/src/main/java/org/apache/iceberg/UpdateSchema.java#L54-L58
We are even inclined to document at some places that a point is nothing special here:
* <p>The given name is used to name the new column and names containing "." are not handled
* differently.
All of this could be solved by using a globaly unique identifier for a field - which for me is very clearly Vec<String> and just handle conversion to a string only representation at the last moment.
I stumbled across this again when implementing SchemaUpdate. So lets forget about the use-case, but maybe still consider harmonizing identifiers for a field.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'