iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Fields with mixed datatypes

Open jayceslesar opened this issue 1 year ago • 1 comments

Question

How would I go about using a field with mixed datatypes? Is that recommended/possible? I am a fan of tall-tidy data and am wondering how to properly go about the following?

from pydantic import BaseModel
from datetime import datetime
import pyarrow as pa

from pyiceberg.catalog.sql import SqlCatalog


class Message(BaseModel):
    system: str
    node: str
    message_name: str
    signal: str
    bus: str
    timestamp: datetime
    value: int | float | bool | str

    @staticmethod
    def to_pyarrow_schema():
        return pa.schema([
            pa.field('system', pa.string()),
            pa.field('node', pa.string()),
            pa.field('message_name', pa.string()),
            pa.field('signal', pa.string()),
            pa.field('bus', pa.string()),
            pa.field('timestamp', pa.timestamp('s', tz='UTC')),
            pa.field(pa.union([pa.field("value", pa.int32()), pa.field("value", pa.float64()), pa.field("value", pa.bool_()), pa.field("value", pa.string())],  mode=pa.lib.UnionMode_SPARSE)),
        ])

catalog = SqlCatalog(
    "default",
    **{
        "uri": "my_uri/catalog",
    },
)

catalog.create_table(
    identifier="default.messages",
    schema=Message.to_pyarrow_schema(),
)

Right now it throws an error TypeError: Expected primitive type, got: <class 'pyarrow.lib.SparseUnionType'> which makes sense as what I am attempting isn't supported.

Should I be using a string type and casting in my queries?

jayceslesar avatar Aug 10 '24 19:08 jayceslesar

I think generally the columns are strongly typed and won't allow a Union type. https://py.iceberg.apache.org/reference/pyiceberg/types/

Here's the spec's description of data types; data type must be either primitives or nested types https://iceberg.apache.org/spec/#schemas-and-data-types

It looks like for primitive types, there are string and binary types which can be the on-disk representation. You would need to cast in the query layer, like how you described above

kevinjqliu avatar Aug 11 '24 17:08 kevinjqliu

thank you!

jayceslesar avatar Aug 15 '24 17:08 jayceslesar