Writing to a table with an optional map field fails if data is missing map field
Question
I am not sure if this is a bug or things working as expected, but when writing to a table with an optional map field, if the input data is missing that field entirely, the write will fail. This is because even though the map is optional, the map key is required. This does not happen with other types.
Here's a script to reproduce:
from pyiceberg.catalog import load_catalog
import pyarrow as pa
warehouse_path = "/tmp/warehouse"
catalog = load_catalog(
"default",
**{
'type': 'sql',
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
},
)
catalog.create_namespace_if_not_exists("test")
schema = pa.schema({
"id": pa.int64(),
"text": pa.string(),
"map": pa.map_(pa.string(), pa.string())
})
table = catalog.create_table_if_not_exists("test.table", schema)
table.append(pa.Table.from_pylist([{"id": 1}]))
This will throw:
│ ✅ │ 1: id: optional long │ 1: id: optional long │
│ ✅ │ 2: text: optional string │ Missing │
│ ✅ │ 3: map: optional map<string, string> │ Missing │
│ ❌ │ 4: key: required string │ Missing │
│ ✅ │ 5: value: optional string │ Missing │
The solution I found is to cast the input data to the table schema when writing, but it's not always practical.
Good catch and thanks for the repro. The Map key is always required https://github.com/apache/iceberg-python/blob/5773b7f1bf2081a90a490f9d670eef804eb88ab4/pyiceberg/types.py#L582
https://github.com/apache/iceberg-python/blob/5773b7f1bf2081a90a490f9d670eef804eb88ab4/pyiceberg/schema.py#L1803
_is_field_compatible with Map key field will always enforce this as required
hm gotta dig into this a little deeper, like you mentioned, casting the pa.Table with provided schema works
from pyiceberg.catalog import load_catalog
import pyarrow as pa
warehouse_path = "/tmp/warehouse"
catalog = load_catalog(
"default",
**{
'type': 'sql',
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
},
)
catalog.create_namespace_if_not_exists("test")
schema = pa.schema({
"id": pa.int64(),
"text": pa.string(),
"map": pa.map_(pa.string(), pa.string())
})
table = catalog.create_table_if_not_exists("test.table", schema)
print("table schema:", table.schema())
print()
data = pa.Table.from_pylist([{"id": 1}], schema=schema)
print("data schema:", data.schema)
print()
table.append(data)