iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Enhance `catalog.create_table` API to enable creation of table with matching `field_ids` to provided Schema

Open sungwy opened this issue 1 year ago • 11 comments

Feature Request / Improvement

Currently, create_table takes a pyiceberg.schema.Schema or pyarrow.Schema. The API ignores the field_ids provided in the schema and issues a set of new ones.

Not only does this cause confusion for the users, because pyiceberg.schema.Schema requires field_ids, but it also means that a table cannot be created with the guarantee that its schema will match the field IDs of the provided schema. This prevents the API from being able to be used for table migrations (discussion on example use case), where a user would want to take the following course of steps:

  1. catalog.load_table to load the pyiceberg.table.Table of an existing Iceberg table.
  2. get pyiceberg.schema.Schema of the loaded table
  3. create a new table in the target catalog using catalog.create_table using the existing table's schema
  4. copy over files into the new table using add_files (this is not possible yet, but there is a discussion that would allow add_files to work on a table with schema with matching field_ids)

Above procedure will not work, unless we introduce an enhancement to create_table that enables creation of a new table that matches the field_ids of the provided Schema.

One way of addressing this issue will be to have two ways of representing the Iceberg table's Schema.

  1. If a schema without field_id is passed into the API, we should create a new pyiceberg.schema.Schema with newly assigned field_ids to create the table.
  2. If a schema with field_id is passed into the API, we should use the exact same pyiceberg.schema.Schema to create the table.

I discuss a few ideas in achieving this below, all with their own pros and cons:

Create a subclass of pyiceberg.schema.Schema without field_id

This sounds like the best approach, but once explored, we quickly realize that this may be impossible. The main challenge with this approach is the fact that pyiceberg.schema.Schema describes its fields using NestedField which are nested structures of pydantic BaseModels with field_id as a required field. So there isn't a way to create a subclass class of pyiceberg.schema.Schema without field_id.

Create a variant class of pyiceberg.schema.Schema without field_id

This is a bit different from above approach, and requires us to make variant classes of pyiceberg.schema.Schema, that are not subclassed from it. This is not ideal, because we will have to maintain field_id-less copies of NestedField, StructType, MapType, ListType and Schema and create methods to create a field_id'ed Schema from its field_id-less variant. It is possible, but it will be hard and messy to manage.

We could make field_id an optional attribute of NestedField and field_id'd iceberg Types

This will allow us to create pyiceberg.schema.Schema with and without field_ids. However, this creates a new opportunity for issues to be introduced into PyIceberg, that have been prevented with the NestedField's attributes directly matching that of the REST Catalog spec. With field_id as an optional field, we will need to introduce a lot more validations across our code base to ensure that the field_id is set in all the nested fields within a schema before using it.

Keep the field_ids when using pyiceberg.schema.Schema, but generate new ones when using pyarrow.Schema

I think this may be the safest approach. This will be user-friendly: pyiceberg.schema.Schema requires field_ids, and pyarrow.Schema does not. pyarrow.Schema is also just a completely different class, so users do not hold the expectation that the field_ids within a pyarrow.Schema are kept in the pyiceberg.schema.Schema (although this is an enhancement we could introduce in the future). When and if we introduce new Schema representations as alternate input schema representations to the API, we could evaluate whether it would make sense to keep the field IDs or assign new ones case by case.

I am personally in favor of the last approach, to revert back to keeping the field_ids of the provided pyiceberg.schema.Schema, if it of that type. And if the input schema is a pyarrow.Schema, we'd create a new Schema with freshly assigned IDs. The behavior of the API will then feel more consistent with how our users are using it, and how they expect the field_ids of the created table Schema to be, in different scenarios.

I'd love to hear the thoughts of our community members on this topic before jumping onto an implementation.

sungwy avatar Nov 01 '24 13:11 sungwy

Thanks for the writeup!

Keep the field_ids when using pyiceberg.schema.Schema, but generate new ones when using pyarrow.Schema

+1 to this

Here's my take: Currently, create_table can take in either a Pyiceberg Schema class (pyiceberg.schema.Schema) or PyArrow Schema class (pyarrow.Schema). Inside create_table, PyArrow Schema is converted to PyIceberg Schema. And then the field_ids are all reassigned. https://github.com/apache/iceberg-python/blob/d559e53ed1895f947274c23de754b802a3f6c46f/pyiceberg/catalog/rest.py#L573-L576

There are 3 ways a user can interact with the create_table API:

  • Passing in a PyIceberg Schema that already has field_ids assigned
  • Passing in a PyIceberg Schema that needs field_ids assignment
  • Passing in a PyArrow Schema

For the first case, we should not reassign the field_ids, but we do today. The the second case, we should assign field_ids as it is required. The the third case, we should convert to PyIceberg Schema and assign field_ids.

I think it'll be more user-friendly for the create_table function to accept either

  1. PyIceberg Schema with field_ids already assigned
  2. PyArrow Schema, in which case, we'll do the conversion and field_ids assignment

Today, I think the only way to craft a PyIceberg Schema that needs field_ids assignment is either through creating it by hand or using the _convert_schema_if_needed function. We might need a way to check the validity of the field_ids in case the Pyiceberg Schema was created by hand.
And we can always couple _convert_schema_if_needed with assign_fresh_schema_ids to ensure that the PyIceberg Schemas always have a valid field_id assignment.

I'm excited about this change. This might also help us streamline create_table with partition_spec and sort_order since there's been an issue with the field_id mismatch.

kevinjqliu avatar Nov 01 '24 18:11 kevinjqliu

Hey @kevinjqliu thanks for the support! Glad to hear we are on the same page regarding this issue.

As with all API changes, I think we would want to deliberate on these options carefully, so I'll keep this issue open for a bit longer for others to opine before moving on to implementation.

sungwy avatar Nov 01 '24 18:11 sungwy

Thanks for bringing this up!

I'm excited about this change. This might also help us streamline create_table with partition_spec and sort_order since there's been an issue with the field_id mismatch.

Indeed :) To provide some historical context. The idea was that PyIceberg was more of a library than an end-user tool, therefore it was always a second layer (like having PyArrow in front of it).

In general, I think the philosophy should be; that people don't have to worry about field IDs, and this should be hidden away. In the case of the table migration, that's some advanced situation where you don't want to re-assign the IDs. I can think of two options:

  • Having a flag to not assign fresh IDs
  • Ability to have a custom assigning class, so you can also do other ways of assigning the IDs (pre-order, post-order etc).

Keep in mind that the REST catalog also might re-assign IDs (I think the one we use in the integration tests also does this).

Fokko avatar Nov 04 '24 13:11 Fokko

Hi @Fokko , thank you for the suggestions! I just put up this PR to introduce a new flag assign_fresh_ids on create_table and related methods.

Keep in mind that the REST catalog also might re-assign IDs (I think the one we use in the integration tests also does this).

Yes I think this is extremely great to make note of. I ran into this issue when setting up the integration tests against the REST Catalog image and I was a bit confused. I understand that it is upto the REST Catalog Server to decide how it wants to create the table once the request is accepted, but I can't help but find it very counter-intuitive that the REST Catalog takes id assigned Schema, PartitionSpec and SortOrder and would still assign fresh IDs ☹️

Regardless, I'm excited to introduce this feature, so that we can support more migration workflows through PyIceberg - which is something many users have been hoping to do through Python applications 🚀

sungwy avatar Nov 08 '24 22:11 sungwy

In general, I think the philosophy should be; that people don't have to worry about field IDs, and this should be hidden away.

+1

In the case of the table migration, that's some advanced situation where you don't want to re-assign the IDs

We can have a function to explicitly assign IDs, such as assign_fresh_schema_ids. If Iceberg Schema is given, it will use its original assigned ID. If PyArrow Schemas is given, it will be converted to Iceberg Schemas and have IDs assigned.

@sungwy WDYT of this method instead of adding the assign_fresh_ids flag to the functions? I think it doesn't make sense for the create_table function to have a flag for schema ID assignment.

I guess the difference is, is there a case where I want to call create_table with Iceberg Schema and have assign_fresh_ids=False? Can the user just call assign_fresh_schema_ids(iceberg_schema) instead? We don't care about PyArrow Schemas since they are always converted and assigned IDs

kevinjqliu avatar Nov 12 '24 05:11 kevinjqliu

looks like in new_table_metadata the ID assignment is propagated to the partition_spec and sort_order as well

https://github.com/apache/iceberg-python/pull/1304/files#diff-a7fbd8c6b0564308ca872ba479f0e240ff2c893d15776a14e52b91b48e9ac470R524-R528

kevinjqliu avatar Nov 12 '24 05:11 kevinjqliu

If Iceberg Schema is given, it will use its original assigned ID. If PyArrow Schemas is given, it will be converted to Iceberg Schemas and have IDs assigned.

@kevinjqliu - Yes, I actually feel more inclined to drive this based on the type of the input schema. To me, the fact that the Iceberg Schema is required to have IDs assigned makes a very strong case for us to respect the given field IDs and propagate it to the Iceberg table metadata.

The case where some Iceberg Rest Catalog Servers do not respect the given field IDs and assigns fresh IDs actually feels more like an edge case to me, that we want to opine on and course correct. Ideally, when the IDs are specified on the Schema, it should be respected by the server. Else, the Schema supplied in a createTableRequest should not have field IDs assigned in the request.

This approach aligns with my initial proposal on the issue description ⬆️

I put up the PR with the assign_fresh_ids flag based on @Fokko 's concerns with above approach. Could I ask for your reviews on this PR, and maybe we can decide on the approach there? https://github.com/apache/iceberg-python/pull/1304

sungwy avatar Nov 13 '24 17:11 sungwy

Thanks for the context yesterday, I was still noodling on it overnight.

If I understand correctly (and please also share the video of you and @adrianqin; I must have missed it during the paternity leave), you're looking for a shallow clone of a table. In this situation, the metadata and manifests are recreated, but the re-use the existing data-files are re-used to avoid unnecessary heavy lifting. As I mentioned yesterday, this is still a bit tricky since you might do a delete operation, where a data-file is being dropped that's still referenced in the other table, but that's inherent to a shallow clone.

Instead of mangling the create_table operation, shouldn't it be better to have a clone operation? This would still hide the field-IDs from the end-user and give a much cleaner API to work with.

The case where some Iceberg Rest Catalog Servers do not respect the given field IDs and assigns fresh IDs actually feels more like an edge case to me, that we want to opine on and course correct.

The REST catalog follows the philosophy that clients shouldn't have to worry about Field-IDs. The register table is designed to take an existing file, and re-use all the metadata, rather than re-assigning it. But it looks like there is interest in more flavors than just create and register. A similar question was also raised on Slack: https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1732441495465079

Fokko avatar Nov 27 '24 07:11 Fokko

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar May 27 '25 00:05 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Jun 10 '25 00:06 github-actions[bot]

I just ran into this issue as well and it was extremely confusing. It was even more complicated from me because I was trying to use required fields with a Dataframe coming from Polars (where nullable=True by default). I was hitting validation errors because of the nullable differences, so I tried to cast my polars Dataframe using my predefined Iceberg schema and it failed because of filed_id mismatches.

arrow_data = polars_df.to_arrow().cast(my_predefined_iceberg_schema.as_arrow())
iceberg_table.append(arrow_data)

My understanding of the recommended approach is:

  1. If you are not using partitions or sorting, then you can use create_table with a pyarrow schema so you don't have to think about IDs.
schema = pa.schema([
        pa.field("foo", pa.timestamp(), nullable=False),
        pa.field("bar", pa.int32(), nullable=True),
])
catalog.create_table(identifier="docs_example.bids", schema=schema)
  1. If you want to use partitions, use the create_table_transaction method instead
with catalog.create_table_transaction(identifier="docs_example.bids", schema=schema) as txn:
    with txn.update_spec() as update_spec:
        update_spec.add_field("foo", transform=DayTransform())
        update_spec.add_identity("bar")
  1. If you want to use partitions and sorting, wait for the new update_sort_order function to make it into a release so you can use it with create_table_transaction.

Unfortunately, create_table_transaction does not work with certain REST catalog implementations like AWS S3 Tables, because they don't currently support stage-create. So none of the options above work for me.

My current workaround is to use create_table with pyiceberg.schema.Schema and PartitionSpec and SortOrder, but add a big warning to my code to remind myself that I shouldn't use that schema for anything else because the field_id values are not honored. For example, if you need to cast your data to match the schema of the table then use catalog.load_table(table_id).schema() to retrieve the REAL schema.

It would be great if there were a cleaner way to do this.

talawahtech avatar Oct 12 '25 05:10 talawahtech