iceberg-python
iceberg-python copied to clipboard
Schema Evolution with `StructType` via `update_schema()` Fails
Apache Iceberg version
0.10.0 (latest release)
Please describe the bug 🐞
⸻
Environment / Setup Details
- PyIceberg version: latest
- Catalog: AWS Glue
- Dependencies: includes pyarrow
- Python version: (please fill this in, e.g. 3.12.10)
⸻
The Problem
Given an existing table test with schema:
Schema(
NestedField(1, "id", StringType(), required=True),
NestedField(2, "name", StringType(), required=False),
NestedField(3, "roll_number", IntegerType(), required=True),
)
I attempt to evolve the schema after table creation by adding a new column address of type StructType:
StructType(
NestedField(4, "street", StringType(), required=False),
NestedField(5, "city", StringType(), required=False),
NestedField(6, "state", StringType(), required=False),
NestedField(7, "zip", IntegerType(), required=False),
)
Using the update_schema() context manager and its add_column(...) method to add this StructType field results in a BadRequestError:
pyiceberg.exceptions.BadRequestError: InvalidInputException: Cannot parse to an integer value: id: 5.0
What should happen:
- The new
StructTypefield should be added without errors. - You should be able to evolve a schema to include nested/struct types via
update_schema()just as you can at table creation. - I remember this working up till last Thursday (18th September 2025)
What is actually happening:
- Adding a
StructTypeviaupdate_schema()throwsInvalidInputException: Cannot parse to an integer value: id: 5.0. - The error indicates something is trying to parse “5.0” (a float) as an integer, presumably where a field-id or column ID is expected to be an integer.
Full traceback
Traceback (most recent call last):
File "/Users/mukul/Documents/extra/iceberg/iceberg.py", line 318, in <module>
with table.update_schema() as updater:
^^^^^^^^^^^^^^^^^^^^^
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/pyiceberg/table/update/__init__.py", line 76, in __exit__
self.commit()
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/pyiceberg/table/update/__init__.py", line 72, in commit
self._transaction._apply(*self._commit())
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/pyiceberg/table/__init__.py", line 295, in _apply
self.commit_transaction()
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/pyiceberg/table/__init__.py", line 936, in commit_transaction
self._table._do_commit( # pylint: disable=W0212
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/pyiceberg/table/__init__.py", line 1458, in _do_commit
response = self.catalog.commit_table(self, requirements, updates)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/tenacity/__init__.py", line 338, in wrapped_f
return copy(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/tenacity/__init__.py", line 477, in __call__
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/tenacity/__init__.py", line 378, in iter
result = action(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/tenacity/__init__.py", line 400, in <lambda>
self._add_action_func(lambda rs: rs.outcome.result())
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.10_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.10_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/tenacity/__init__.py", line 480, in __call__
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/pyiceberg/catalog/rest/__init__.py", line 722, in commit_table
_handle_non_200_response(
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/pyiceberg/catalog/rest/response.py", line 111, in _handle_non_200_response
raise exception(response) from exc
pyiceberg.exceptions.BadRequestError: InvalidInputException: Cannot parse to an integer value: id: 5.0
⸻
Steps to Reproduce
- Create a table with the original schema:
schema = Schema(
NestedField(1, "id", StringType(), required=True),
NestedField(2, "name", StringType(), required=False),
NestedField(3, "roll_number", IntegerType(), required=True),
)
table = catalog.create_table(
identifier=table_id,
schema=schema,
)
- Load the table and attempt schema evolution:
table = catalog.load_table(table_id)
with table.update_schema() as updater:
updater.add_column(
path="address",
field_type=StructType(
NestedField(4, "street", StringType(), required=False),
NestedField(5, "city", StringType(), required=False),
NestedField(6, "state", StringType(), required=False),
NestedField(7, "zip", IntegerType(), required=False),
),
required=False,
)
- Observe the error above.
⸻
Additional Observations
- The error only occurs when using
update_schema()/ schema-evolution after the table has been created. - Creating the table with the
StructTypealready included does not cause this error. - Also, if the
StructTypefield already exists (from creation) and then you try to add a new integer column (simple type) usingupdate_schema(), you encounter a similar error.
⸻
Suggested Investigation / Possible Cause
- The error message Cannot parse to an integer value: id: 5.0 suggests something is wrongly computing a field or column ID as a float (5.0 instead of integer 5).
- Perhaps the incremental assignment of new field IDs in schema evolution is mishandled when adding nested/struct types.
- Possible bug in the serialization or metadata packaging step, or in how nested field IDs are validated / sent to the catalog (Glue/REST) interface.
⸻
Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
hey @mukul-mpac thanks for reporting this issue. i am not able to reproduce the issue
heres what i tried
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import IntegerType, NestedField, StringType, StructType
catalog = load_catalog("default", type="in-memory")
catalog.create_namespace_if_not_exists("default")
schema = Schema(
NestedField(1, "id", StringType(), required=True),
NestedField(2, "name", StringType(), required=False),
NestedField(3, "roll_number", IntegerType(), required=True),
)
table = catalog.create_table(
identifier="default.test",
schema=schema,
)
with table.update_schema() as updater:
updater.add_column(
path="address",
field_type=StructType(
NestedField(4, "street", StringType(), required=False),
NestedField(5, "city", StringType(), required=False),
NestedField(6, "state", StringType(), required=False),
NestedField(7, "zip", IntegerType(), required=False),
),
required=False,
)
table.schema()
>>> table.schema()
Schema(NestedField(field_id=1, name='id', field_type=StringType(), required=True), NestedField(field_id=2, name='name', field_type=StringType(), required=False), NestedField(field_id=3, name='roll_number', field_type=IntegerType(), required=True), NestedField(field_id=4, name='address', field_type=StructType(fields=(NestedField(field_id=5, name='street', field_type=StringType(), required=False), NestedField(field_id=6, name='city', field_type=StringType(), required=False), NestedField(field_id=7, name='state', field_type=StringType(), required=False), NestedField(field_id=8, name='zip', field_type=IntegerType(), required=False),)), required=False), schema_id=1, identifier_field_ids=[])
File "/Users/mukul/Documents/extra/iceberg/venv/lib/python3.12/site-packages/pyiceberg/catalog/rest/response.py", line 111, in _handle_non_200_response raise exception(response) from exc pyiceberg.exceptions.BadRequestError: InvalidInputException: Cannot parse to an integer value: id: 5.0
the error message is coming from the catalog response, which is aws glue in this case
@kevinjqliu Thank you for the response,
I have started a discussion in aws forums based on your intel (here).
Is there any other steps you would recommend to solve this issue? I expect this to be a major blocker since a lot of developers must be utilizing AWS Glue REST API Catalog.
Hey there, I think the API should be invoked as:
with table.update_schema() as updater:
updater.add_column(("address", "street"), StringType(), required=False),
...
)
However, after a quick test, this still causes an issue. In this case, we should inject a StructType. Thoughts?
@Fokko
with table.update_schema() as updater:
updater.add_column(("address", "street"), StringType(), required=False),
...
)
The above works if address is already present as a StructType, it does not solve my current issue but thank you for your reply.
Ran into this and after doing some testing I believe the problem is on the AWS side. Just commenting here because it may have more visibility than on the AWS forum. I am currently working with AWS support on the matter.
This is pretty easily reproducible; just try to update schema with a non-primitive new column (e.g. list or struct). For example:
catalog = load_catalog(
'default',
type='rest',
warehouse=f'{account}:s3tablescatalog/{bucket}',
uri=f'https://glue.{region}.amazonaws.com/iceberg',
**{
'rest.sigv4-enabled': 'true',
'rest.signing-name': 'glue',
'rest.signing-region': region,
}
)
catalog.create_namespace('scott_test')
# create table: OK
initial_schema = pa.schema([
pa.field("a", pa.string(), nullable=True),
pa.field("b", pa.string(), nullable=True),
])
catalog.create_table('scott_test.element_id_bug', initial_schema)
# update table w/ primitive type: OK
table = catalog.load_table('scott_test.element_id_bug')
update_schema = pa.schema([
pa.field("c", pa.string(), nullable=True),
])
with table.update_schema() as update:
update.union_by_name(update_schema)
# update table with list type: FAIL
table = catalog.load_table('scott_test.element_id_bug')
update_schema = pa.schema([
pa.field("d", pa.list_(pa.string()), nullable=True),
])
with table.update_schema() as update:
update.union_by_name(update_schema)
This last operation throws an exception:
BadRequestError: InvalidInputException: Cannot parse to an integer value: element-id: 5.0
I captured the rest payload and confirmed that element id 5 is being sent as an integer:
{
"identifier": {
"namespace": [
"scott_test"
],
"name": "element_id_bug"
},
"requirements": [
{
"type": "assert-current-schema-id",
"current-schema-id": 1
},
{
"type": "assert-table-uuid",
"uuid": "REDACTED"
}
],
"updates": [
{
"action": "add-schema",
"schema": {
"type": "struct",
"fields": [
{
"id": 1,
"name": "a",
"type": "string",
"required": false
},
{
"id": 2,
"name": "b",
"type": "string",
"required": false
},
{
"id": 3,
"name": "c",
"type": "string",
"required": false
},
{
"id": 4,
"name": "d",
"type": {
"type": "list",
"element-id": 5,
"element": "string",
"element-required": false
},
"required": false
}
],
"schema-id": 2,
"identifier-field-ids": []
},
"last-column-id": 5
},
{
"action": "set-current-schema",
"schema-id": -1
}
]
}
So this pretty clearly seems like an AWS issue.
I just tried another combination and this also triggers it:
# create table with list type: OK
initial_schema = pa.schema([
pa.field("a", pa.string(), nullable=True),
pa.field("b", pa.list_(pa.string()), nullable=True),
])
catalog.create_table('scott_test.element_id_bug', initial_schema)
# update table with primitive: FAIL
table = catalog.load_table('scott_test.element_id_bug')
update_schema = pa.schema([
pa.field("c", pa.string(), nullable=True),
])
with table.update_schema() as update:
update.union_by_name(update_schema)
... which somewhat makes sense because the REST payload above seems to contain all fields in the commit_table() request. But I can't imagine how it's anything but an AWS problem.
Update:
We're currently in process of fixing this with the AWS. The support team has successfully reproduced the error and passed it on to the Glue service team.
I have an update from the AWS team - they have deployed a fix in the eu-west-1 region, and we’ve confirmed that it works correctly now.
thanks for the update @jakuborlowski
Another update from the AWS team:
The fix should also be in us-east-1 region along with `eu-west-1. Slowly to be rolled out in the other regions!
Update:
Great news, AWS has rolled out the fix.
I have tested the broken functionalities as original described in the issue and it works well now.
For reference I am in region ca-central-1