iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

TypeError when `operation` field is missing in `summary`.

Open questsul opened this issue 1 year ago • 4 comments

Apache Iceberg version

0.6.0

Please describe the bug 🐞

When attempting to read the metadata.json file, which contains a list of snapshots where some snapshot summaries lack the operation field, PyIceberg encounters the following error:

TypeError: Summary.init() missing 1 required positional argument: 'operation'.

Interestingly, when parsing the same metadata file using the Iceberg Java library, it works without any issues.

Full stack trace:

 File "reader.py", line 91, in _get_iceberg_table
    return StaticTable.from_metadata(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/pyiceberg/table/__init__.py", line 1101, in from_metadata
    metadata = FromInputFile.table_metadata(file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/pyiceberg/serializers.py", line 113, in table_metadata
    return FromByteStream.table_metadata(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/pyiceberg/serializers.py", line 94, in table_metadata
    return TableMetadataUtil.parse_raw(metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/pyiceberg/table/metadata.py", line 461, in parse_raw
    return TableMetadataWrapper.model_validate_json(data).root
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/pydantic/main.py", line 580, in model_validate_json
    return cls.__pydantic_validator__.validate_json(json_data, strict=strict, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Summary.__init__() missing 1 required positional argument: 'operation'

Metadata.json example:

{
  "format-version" : 2,
  "table-uuid" : "9996bcdf-3277-48f4-9e76-9e81766c9e0e",
  "location" : "file://t/some_table/",
  "last-sequence-number" : 45,
  "last-updated-ms" : 1724611070351,
  "last-column-id" : 79,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "DATA",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "COLUMN_NAME",
      "required" : false,
      "type" : "string"
    }]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "format-version" : "2"
  },
  "current-snapshot-id" : 1724611070351000000,
  "snapshots" : [ {
    "sequence-number" : 44,
    "snapshot-id" : 1724610129117000000,
    "timestamp-ms" : 1724610129117,
    "manifest-list" : "file://t/some_table/metadata/snap-1724610129117000000-d9b50309-0dff-472d-8711-86ca70021ffb.avro",
    "schema-id" : 0,
    "summary" : {
      "manifests-created" : "8",
      "total-records" : "26508666891",
      "added-files-size" : "3927895626752",
      "manifests-kept" : "0",
      "total-files-size" : "3927895626752",
      "added-records" : "26508666891",
      "added-data-files" : "231513",
      "manifests-replaced" : "0",
      "total-data-files" : "231513"
    }
  }, {
    "sequence-number" : 43,
    "snapshot-id" : 1724006578422000000,
    "timestamp-ms" : 1724006578422,
    "manifest-list" : "file://t/some_table/metadata/snap-1724006578422000000-289566b5-78fe-4b60-9ffa-ab25dee1edde.avro",
    "schema-id" : 0,
    "summary" : {
      "total-files-size" : "3888310341632",
      "added-records" : "26224534820",
      "added-data-files" : "225313",
      "manifests-replaced" : "0",
      "total-data-files" : "225313",
      "manifests-created" : "56",
      "total-records" : "26224534820",
      "added-files-size" : "3888310341632",
      "manifests-kept" : "0"
    }
  }, {
    "sequence-number" : 45,
    "snapshot-id" : 1724611070351000000,
    "timestamp-ms" : 1724611070351,
    "manifest-list" : "file://t/some_table/metadata/snap-1724611070351000000-6a307203-7148-467f-88eb-f932b32dd7f4.avro",
    "schema-id" : 0,
    "summary" : {
      "added-files-size" : "3929709293568",
      "total-records" : "26508666891",
      "manifests-created" : "8",
      "total-data-files" : "227581",
      "manifests-replaced" : "0",
      "added-data-files" : "227581",
      "added-records" : "26508666891",
      "total-files-size" : "3929709293568",
      "operation" : "append",
      "manifests-kept" : "0"
    }
  } ],
  "snapshot-log" : [ {
    "snapshot-id" : 1724006578422000000,
    "timestamp-ms" : 1724006578422
  }, {
    "snapshot-id" : 1724610129117000000,
    "timestamp-ms" : 1724610129117
  }, {
    "snapshot-id" : 1724611070351000000,
    "timestamp-ms" : 1724611070351
  } ]
}

questsul avatar Aug 27 '24 09:08 questsul

How to reproduce: For pyicberg I was using metadata file stored in Azure blob storage

    static_table = StaticTable.from_metadata(
        "abfs://path/metadata/example.metadata.json",
        properties={
            "adlfs.connection-string": "ADD THIS",
        },
    )

Here is java snippet I used for verification:

package com.example;

import org.apache.iceberg.StaticTableOperations;
import org.apache.iceberg.TableMetadata;
import org.apache.iceberg.inmemory.InMemoryFileIO;
import org.apache.iceberg.io.FileIO;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

import java.nio.file.Paths;

public class StaticTableExample {
   public static void main(String[] args) throws IOException {

   StaticTableExample ex = new StaticTableExample();

   ex.start();
   }

   public void start() throws IOException {
       InMemoryFileIO fileIO = new InMemoryFileIO();

       String metadataFilePath = "example.metadata.json";

       fileIO.addFile(metadataFilePath, readFileAsBytesNIO(metadataFilePath));

       StaticTableOperations ops = new StaticTableOperations(
               metadataFilePath,
               fileIO
       );

       TableMetadata meta = ops.current();

       System.out.println("Table location: " + meta.location());
   }

   public byte[] readFileAsBytesNIO(String filePath) throws IOException {
       Path path = Path.of(filePath);
       return Files.readAllBytes(path);
   }
}

questsul avatar Aug 27 '24 09:08 questsul

I'm not entirely sure if I'm looking at the correct code, but it seems that in Java, the operation field might be optional during parsing. For example, it appears that operation can be set to null:

String operation = null;

The metadata.json file I provided was produced by Snowflake. After Snowflake made some updates to their Iceberg implementation, they began creating metadata files in this format. Previously, there were no issues reading Snowflake Iceberg tables using PyIceberg.

questsul avatar Aug 27 '24 10:08 questsul

Hi @questsul - thank you for raising this issue, and for providing this analysis. The optional and required attributes in PyIceberg are based on the nullability of the objects as they are defined within the Rest Catalog Open API spec. Here, the operation field in summary is labeled to be a required attribute:

https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060

I think there's a few takeaways here based on our findings:

  1. operation is a required field
  2. Java is interestingly more graceful in parsing the operation tag (and it probably should not be)
  3. The metadata files in your table are unfortunately not conformant to the REST API Spec. I think following up with the folks at Snowflake and seeing if there's a way to fix these metadata files (for example by making a copy and adding operation tags that are missing) may not be a bad idea

sungwy avatar Aug 28 '24 12:08 sungwy

Java is interestingly more graceful in parsing the operation tag (and it probably should not be)

@sungwy does this mean that the current JAVA implementation does not adhere to the spec? If so, we should open a ticket to track

kevinjqliu avatar Aug 31 '24 13:08 kevinjqliu

It's currently impossible to use Pyiceberg to read data from Iceberg tables generated via Snowflake because of how they implemented but I find it unlikely that they will fix the implementation soon. :(

I understand that the spec is clear but considering the Java SDK is the primary interface, it might be worth considering an alternative approach in this specific case IMO.

In case anybody is blocked because of this issue, I use the following monkey-patching workaround:

Summary.model_fields['operation'].default = Operation.APPEND

def summary_init(summary, **kwargs):
    super(IcebergBaseModel, summary).__init__(**kwargs)
    summary._additional_properties = kwargs


Summary.__init__ = summary_init

buremba avatar Oct 15 '24 01:10 buremba

Thanks for the context and the workaround!

I think the following steps will help us resolve this issue

  • [x] In PyIceberg, allow reading summary without the operation field. Even though this deviates from the spec, it's what the current JAVA implementation supports
  • [x] Raise an issue on the iceberg devlist about the deviation from Java (and Python) implementation from the spec (edit: https://lists.apache.org/thread/h9qmrmlgxh91ol0y2v8olt90b9q6p9xr)
  • [ ] Raise an issue with Snowflake to write the operation field and align with the iceberg spec

How does this sound?

kevinjqliu avatar Oct 15 '24 19:10 kevinjqliu

That sounds like a good plan @kevinjqliu !

buremba avatar Oct 17 '24 00:10 buremba

Figuring out the proper resolution on devlist before proceeding with a solution

kevinjqliu avatar Oct 17 '24 16:10 kevinjqliu

resolved by #1263

kevinjqliu avatar Jan 06 '25 18:01 kevinjqliu

Hi, did anyone already raise an issue with Snowflake on this issue?

I will open one today but would be good to understand if others have done as well.

nicornk avatar Sep 02 '25 06:09 nicornk

I stumbled on the issue and opened a support case with Snowflake.

florian-ernst-alan avatar Sep 04 '25 16:09 florian-ernst-alan

👍 I also have one open.

nicornk avatar Sep 04 '25 16:09 nicornk

hey @florian-ernst-alan @nicornk thanks for raising this with snowflake, that was the last item to fully resolve this issue :)

kevinjqliu avatar Sep 04 '25 16:09 kevinjqliu

@kevinjqliu This is the response I received - let's see how long the fix takes. "The engineering team has determined that the issue is caused by a serialization issue around history snapshots, which they are working on fixing."

nicornk avatar Sep 05 '25 06:09 nicornk