iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Writing to a table with an optional map field fails if data is missing map field

Open dbnl-renaud opened this issue 2 months ago • 2 comments

Question

I am not sure if this is a bug or things working as expected, but when writing to a table with an optional map field, if the input data is missing that field entirely, the write will fail. This is because even though the map is optional, the map key is required. This does not happen with other types.

Here's a script to reproduce:

from pyiceberg.catalog import load_catalog
import pyarrow as pa

warehouse_path = "/tmp/warehouse"
catalog = load_catalog(
    "default",
    **{
        'type': 'sql',
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

catalog.create_namespace_if_not_exists("test")

schema = pa.schema({
    "id": pa.int64(),
    "text": pa.string(),
    "map": pa.map_(pa.string(), pa.string())
})

table = catalog.create_table_if_not_exists("test.table", schema)

table.append(pa.Table.from_pylist([{"id": 1}]))

This will throw:

│ ✅ │ 1: id: optional long                 │ 1: id: optional long │
│ ✅ │ 2: text: optional string             │ Missing              │
│ ✅ │ 3: map: optional map<string, string> │ Missing              │
│ ❌ │ 4: key: required string              │ Missing              │
│ ✅ │ 5: value: optional string            │ Missing              │

The solution I found is to cast the input data to the table schema when writing, but it's not always practical.

dbnl-renaud avatar Nov 03 '25 20:11 dbnl-renaud

Good catch and thanks for the repro. The Map key is always required https://github.com/apache/iceberg-python/blob/5773b7f1bf2081a90a490f9d670eef804eb88ab4/pyiceberg/types.py#L582

https://github.com/apache/iceberg-python/blob/5773b7f1bf2081a90a490f9d670eef804eb88ab4/pyiceberg/schema.py#L1803

_is_field_compatible with Map key field will always enforce this as required

kevinjqliu avatar Nov 03 '25 20:11 kevinjqliu

hm gotta dig into this a little deeper, like you mentioned, casting the pa.Table with provided schema works

from pyiceberg.catalog import load_catalog
import pyarrow as pa

warehouse_path = "/tmp/warehouse"
catalog = load_catalog(
    "default",
    **{
        'type': 'sql',
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

catalog.create_namespace_if_not_exists("test")

schema = pa.schema({
    "id": pa.int64(),
    "text": pa.string(),
    "map": pa.map_(pa.string(), pa.string())
})

table = catalog.create_table_if_not_exists("test.table", schema)
print("table schema:", table.schema())
print()

data = pa.Table.from_pylist([{"id": 1}], schema=schema)
print("data schema:", data.schema)
print()
table.append(data)

kevinjqliu avatar Nov 03 '25 21:11 kevinjqliu