iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

Schema: Allow field name `foo.bar` even if struct foo->bar is present

Open c-thiel opened this issue 1 year ago • 5 comments

Currently due to the way name-to-id is build, we cannot have points in columnames if it collides with a struct.

The following schema fails to build:

        let schema = Schema::builder()
            .with_schema_id(1)
            .with_fields(vec![
                NestedField::required(1, "foo", Primitive(PrimitiveType::String)).into(),
                NestedField::required(2, "foo.bar", Primitive(PrimitiveType::Int)).into(),
                NestedField::required(
                    3,
                    "foo",
                    Type::Struct(StructType::new(vec![NestedField::required(
                        4,
                        "bar",
                        Primitive(PrimitiveType::Int),
                    )
                    .into()])),
                )
                .into(),
            ])
            .build()
            .unwrap();

By prohibiting this we follow the Java implementation. There is nothing in the iceberg spec that prohibits these names, so I think we should allow them. For column names as a user I expect the same behaviors as for namespaces or databases with points - it needs to be escaped. As escaping depends on the query engine, and iceberg-rust as well as iceberg java has no SQL-parser, those libraries should not take away the option from the engine or make that decision in their stead.

I propose to change the representation of a colname to Vec<String> instead of just "String with points". It would also make accessor compatible between schemas - even if we decide to stick keep this artificial restriction.

c-thiel avatar Aug 29 '24 07:08 c-thiel

CC @Fokko , @nastra , @Xuanwo

c-thiel avatar Aug 29 '24 07:08 c-thiel

Allowing such a case would also require a change on the table scan api, e.g. we need a way to allow user to tell use what foo.bar actually means.

liurenjie1024 avatar Aug 29 '24 08:08 liurenjie1024

Also cc @rdblue

liurenjie1024 avatar Aug 29 '24 10:08 liurenjie1024

FWIW. Can we introduce a structure similar to TableIdent to represent field paths, named 'FieldPath'?

  1. We could modify the elements of selected TableScan::column_names from string to FieldPath (and provide some utilities to simplify the construction of FieldPath).
  2. Change the key in name_to_id to FieldPath.
  3. Perhaps the logic for looking up fields via FieldPath would also need to be modified.

yukkit avatar Sep 07 '24 09:09 yukkit

This is an implementation detail that is not part of the spec. But I don't think it is worth bothering to enable both ["foo.bar"] and ["foo", "bar"] identifiers in the same schema. That's confusing for users, who will almost certainly be confused by tables with such odd structures.

The reference implementation has been this way for about 7 years and no one has every complained or, to my knowledge, hit a problem with this in practice. I highly recommend focusing time and effort on other improvements.

rdblue avatar Sep 16 '24 18:09 rdblue

This is not so much about this specific use case - which I also don't care about much either, but about having two different representations for the same entity. Took me a while to formulate it so clearly. There are many lines of code in different files just to work around this problem.

Let's assume for example we want to add a field to a schema, then the point representation in the PartitionSpec is not compatible with the one in the Schema. Java would just throw an exception in this case. Because we have two distinct representations for the same entity, and those representations are not bijective, we need extra code to handle conversion.

https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/core/src/main/java/org/apache/iceberg/SchemaUpdate.java#L97-L103 also https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/core/src/main/java/org/apache/iceberg/SchemaUpdate.java#L112 several lines of docs here: https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/api/src/main/java/org/apache/iceberg/UpdateSchema.java#L54-L58

We are even inclined to document at some places that a point is nothing special here:

 * <p>The given name is used to name the new column and names containing "." are not handled
 * differently.

All of this could be solved by using a globaly unique identifier for a field - which for me is very clearly Vec<String> and just handle conversion to a string only representation at the last moment.

I stumbled across this again when implementing SchemaUpdate. So lets forget about the use-case, but maybe still consider harmonizing identifiers for a field.

c-thiel avatar Nov 13 '24 16:11 c-thiel

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Sep 30 '25 00:09 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Oct 15 '25 00:10 github-actions[bot]