Fields with mixed datatypes
Question
How would I go about using a field with mixed datatypes? Is that recommended/possible? I am a fan of tall-tidy data and am wondering how to properly go about the following?
from pydantic import BaseModel
from datetime import datetime
import pyarrow as pa
from pyiceberg.catalog.sql import SqlCatalog
class Message(BaseModel):
system: str
node: str
message_name: str
signal: str
bus: str
timestamp: datetime
value: int | float | bool | str
@staticmethod
def to_pyarrow_schema():
return pa.schema([
pa.field('system', pa.string()),
pa.field('node', pa.string()),
pa.field('message_name', pa.string()),
pa.field('signal', pa.string()),
pa.field('bus', pa.string()),
pa.field('timestamp', pa.timestamp('s', tz='UTC')),
pa.field(pa.union([pa.field("value", pa.int32()), pa.field("value", pa.float64()), pa.field("value", pa.bool_()), pa.field("value", pa.string())], mode=pa.lib.UnionMode_SPARSE)),
])
catalog = SqlCatalog(
"default",
**{
"uri": "my_uri/catalog",
},
)
catalog.create_table(
identifier="default.messages",
schema=Message.to_pyarrow_schema(),
)
Right now it throws an error TypeError: Expected primitive type, got: <class 'pyarrow.lib.SparseUnionType'> which makes sense as what I am attempting isn't supported.
Should I be using a string type and casting in my queries?
I think generally the columns are strongly typed and won't allow a Union type. https://py.iceberg.apache.org/reference/pyiceberg/types/
Here's the spec's description of data types; data type must be either primitives or nested types https://iceberg.apache.org/spec/#schemas-and-data-types
It looks like for primitive types, there are string and binary types which can be the on-disk representation. You would need to cast in the query layer, like how you described above
thank you!