TypeError when `operation` field is missing in `summary`.
Apache Iceberg version
0.6.0
Please describe the bug 🐞
When attempting to read the metadata.json file, which contains a list of snapshots where some snapshot summaries lack the operation field, PyIceberg encounters the following error:
TypeError: Summary.init() missing 1 required positional argument: 'operation'.
Interestingly, when parsing the same metadata file using the Iceberg Java library, it works without any issues.
Full stack trace:
File "reader.py", line 91, in _get_iceberg_table
return StaticTable.from_metadata(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.12/site-packages/pyiceberg/table/__init__.py", line 1101, in from_metadata
metadata = FromInputFile.table_metadata(file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.12/site-packages/pyiceberg/serializers.py", line 113, in table_metadata
return FromByteStream.table_metadata(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.12/site-packages/pyiceberg/serializers.py", line 94, in table_metadata
return TableMetadataUtil.parse_raw(metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.12/site-packages/pyiceberg/table/metadata.py", line 461, in parse_raw
return TableMetadataWrapper.model_validate_json(data).root
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.12/site-packages/pydantic/main.py", line 580, in model_validate_json
return cls.__pydantic_validator__.validate_json(json_data, strict=strict, context=context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Summary.__init__() missing 1 required positional argument: 'operation'
Metadata.json example:
{
"format-version" : 2,
"table-uuid" : "9996bcdf-3277-48f4-9e76-9e81766c9e0e",
"location" : "file://t/some_table/",
"last-sequence-number" : 45,
"last-updated-ms" : 1724611070351,
"last-column-id" : 79,
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"fields" : [ {
"id" : 1,
"name" : "DATA",
"required" : false,
"type" : "string"
}, {
"id" : 2,
"name" : "COLUMN_NAME",
"required" : false,
"type" : "string"
}]
} ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ ]
} ],
"last-partition-id" : 999,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : {
"format-version" : "2"
},
"current-snapshot-id" : 1724611070351000000,
"snapshots" : [ {
"sequence-number" : 44,
"snapshot-id" : 1724610129117000000,
"timestamp-ms" : 1724610129117,
"manifest-list" : "file://t/some_table/metadata/snap-1724610129117000000-d9b50309-0dff-472d-8711-86ca70021ffb.avro",
"schema-id" : 0,
"summary" : {
"manifests-created" : "8",
"total-records" : "26508666891",
"added-files-size" : "3927895626752",
"manifests-kept" : "0",
"total-files-size" : "3927895626752",
"added-records" : "26508666891",
"added-data-files" : "231513",
"manifests-replaced" : "0",
"total-data-files" : "231513"
}
}, {
"sequence-number" : 43,
"snapshot-id" : 1724006578422000000,
"timestamp-ms" : 1724006578422,
"manifest-list" : "file://t/some_table/metadata/snap-1724006578422000000-289566b5-78fe-4b60-9ffa-ab25dee1edde.avro",
"schema-id" : 0,
"summary" : {
"total-files-size" : "3888310341632",
"added-records" : "26224534820",
"added-data-files" : "225313",
"manifests-replaced" : "0",
"total-data-files" : "225313",
"manifests-created" : "56",
"total-records" : "26224534820",
"added-files-size" : "3888310341632",
"manifests-kept" : "0"
}
}, {
"sequence-number" : 45,
"snapshot-id" : 1724611070351000000,
"timestamp-ms" : 1724611070351,
"manifest-list" : "file://t/some_table/metadata/snap-1724611070351000000-6a307203-7148-467f-88eb-f932b32dd7f4.avro",
"schema-id" : 0,
"summary" : {
"added-files-size" : "3929709293568",
"total-records" : "26508666891",
"manifests-created" : "8",
"total-data-files" : "227581",
"manifests-replaced" : "0",
"added-data-files" : "227581",
"added-records" : "26508666891",
"total-files-size" : "3929709293568",
"operation" : "append",
"manifests-kept" : "0"
}
} ],
"snapshot-log" : [ {
"snapshot-id" : 1724006578422000000,
"timestamp-ms" : 1724006578422
}, {
"snapshot-id" : 1724610129117000000,
"timestamp-ms" : 1724610129117
}, {
"snapshot-id" : 1724611070351000000,
"timestamp-ms" : 1724611070351
} ]
}
How to reproduce: For pyicberg I was using metadata file stored in Azure blob storage
static_table = StaticTable.from_metadata(
"abfs://path/metadata/example.metadata.json",
properties={
"adlfs.connection-string": "ADD THIS",
},
)
Here is java snippet I used for verification:
package com.example;
import org.apache.iceberg.StaticTableOperations;
import org.apache.iceberg.TableMetadata;
import org.apache.iceberg.inmemory.InMemoryFileIO;
import org.apache.iceberg.io.FileIO;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class StaticTableExample {
public static void main(String[] args) throws IOException {
StaticTableExample ex = new StaticTableExample();
ex.start();
}
public void start() throws IOException {
InMemoryFileIO fileIO = new InMemoryFileIO();
String metadataFilePath = "example.metadata.json";
fileIO.addFile(metadataFilePath, readFileAsBytesNIO(metadataFilePath));
StaticTableOperations ops = new StaticTableOperations(
metadataFilePath,
fileIO
);
TableMetadata meta = ops.current();
System.out.println("Table location: " + meta.location());
}
public byte[] readFileAsBytesNIO(String filePath) throws IOException {
Path path = Path.of(filePath);
return Files.readAllBytes(path);
}
}
I'm not entirely sure if I'm looking at the correct code, but it seems that in Java, the operation field might be optional during parsing. For example, it appears that operation can be set to null:
The metadata.json file I provided was produced by Snowflake. After Snowflake made some updates to their Iceberg implementation, they began creating metadata files in this format. Previously, there were no issues reading Snowflake Iceberg tables using PyIceberg.
Hi @questsul - thank you for raising this issue, and for providing this analysis. The optional and required attributes in PyIceberg are based on the nullability of the objects as they are defined within the Rest Catalog Open API spec. Here, the operation field in summary is labeled to be a required attribute:
https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060
I think there's a few takeaways here based on our findings:
- operation is a required field
- Java is interestingly more graceful in parsing the operation tag (and it probably should not be)
- The metadata files in your table are unfortunately not conformant to the REST API Spec. I think following up with the folks at Snowflake and seeing if there's a way to fix these metadata files (for example by making a copy and adding operation tags that are missing) may not be a bad idea
Java is interestingly more graceful in parsing the operation tag (and it probably should not be)
@sungwy does this mean that the current JAVA implementation does not adhere to the spec? If so, we should open a ticket to track
It's currently impossible to use Pyiceberg to read data from Iceberg tables generated via Snowflake because of how they implemented but I find it unlikely that they will fix the implementation soon. :(
I understand that the spec is clear but considering the Java SDK is the primary interface, it might be worth considering an alternative approach in this specific case IMO.
In case anybody is blocked because of this issue, I use the following monkey-patching workaround:
Summary.model_fields['operation'].default = Operation.APPEND
def summary_init(summary, **kwargs):
super(IcebergBaseModel, summary).__init__(**kwargs)
summary._additional_properties = kwargs
Summary.__init__ = summary_init
Thanks for the context and the workaround!
I think the following steps will help us resolve this issue
- [x] In PyIceberg, allow reading
summarywithout theoperationfield. Even though this deviates from the spec, it's what the current JAVA implementation supports - [x] Raise an issue on the iceberg devlist about the deviation from Java (and Python) implementation from the spec (edit: https://lists.apache.org/thread/h9qmrmlgxh91ol0y2v8olt90b9q6p9xr)
- [ ] Raise an issue with Snowflake to write the
operationfield and align with the iceberg spec
How does this sound?
That sounds like a good plan @kevinjqliu !
Figuring out the proper resolution on devlist before proceeding with a solution
resolved by #1263
Hi, did anyone already raise an issue with Snowflake on this issue?
I will open one today but would be good to understand if others have done as well.
I stumbled on the issue and opened a support case with Snowflake.
👍 I also have one open.
hey @florian-ernst-alan @nicornk thanks for raising this with snowflake, that was the last item to fully resolve this issue :)
@kevinjqliu This is the response I received - let's see how long the fix takes. "The engineering team has determined that the issue is caused by a serialization issue around history snapshots, which they are working on fixing."