iceberg-python Support native Pydantic schemas

Feature Request / Improvement

How difficult or feasible would it be to enable support for native Pydantic schemas that automatically map to the underlying pyiceberg types etc?

e.g. I would like to define a schema as a regular pydantic model:

class Document(Schema):
    id: int
    name: str
    confidence: float = 0.0

and have it automatically converted to:

Schema(
    NestedField(
        field_id=1,
        field_type=IntegerType(),
        name="id",
        required=True,
    ),
    NestedField(
        field_id=2,
        field_type=StringType(),
        name="name",
        required=True,
    ),
    NestedField(
        field_id=3,
        field_type=FloatType(),
        initial_value=0.0,
        name="confidence",
        required=False,
    ),
)

New to both pyiceberg and pydantic so not sure if this is possible or where to start, thanks!

Apr 18 '25 15:04 potatochipcoconut

Came up with naive attempt, open to feedback. Not sure how it would handle e.g. int vs long, float vs double, etc

import builtins
import datetime
import uuid

from pydantic import BaseModel
from pydantic import Field
from pydantic_core import PydanticUndefined
from pyiceberg.schema import Schema as IcebergSchema
from pyiceberg.types import (
    BooleanType,
    DateType,
    FloatType,
    IntegerType,
    MapType,
    NestedField,
    PrimitiveType,
    StringType,
    TimestampType,
    UUIDType,
)
from typing import (
    List,
    Literal,
)


class UnknownType(PrimitiveType):
    """Remove after next public release."""

    root: Literal['unknown'] = Field(default='unknown')


class Schema(BaseModel):
    @classmethod
    def model_pyiceberg_schema(cls):
        """
        Generate a PyIceberg Schema for a model class.

        Returns:
            The pyiceberg Schema compatible with Apache Iceberg Tables.
        """
        pyiceberg_fields: List[NestedField] = []

        for index, (name, field) in enumerate(cls.model_fields.items()):
            default = (
                field.default if field.default != PydanticUndefined else None
            )

            match field.annotation:
                case builtins.bool:
                    field_type = BooleanType()
                case builtins.int:
                    field_type = IntegerType()
                case datetime.date:
                    field_type = DateType()
                case builtins.dict:
                    field_type = MapType()
                case builtins.float:
                    field_type = FloatType()
                case builtins.str:
                    field_type = StringType()
                case datetime.datetime:
                    field_type = TimestampType()
                case uuid.UUID:
                    field_type = UUIDType()
                case _:
                    field_type = UnknownType()

            pyiceberg_fields.append(
                NestedField(
                    field_id=index + 1,
                    field_type=field_type,
                    initial_default=default, # not working, unsure why.
                    name=name,
                    required=field.is_required(),
                ),
            )

        return IcebergSchema(*pyiceberg_fields)


# define a schema using native pydantic class.
class Person(Schema):
    name: str
    age: int
    location: str = 'Atlanta, GA'
    dob: datetime.date

schema = Person.model_pyiceberg_schema()

# create table.
catalog.create_table('default.people', schema=schema.as_arrow())

Apr 18 '25 19:04 potatochipcoconut

You can look at https://github.com/simw/pydantic-to-pyarrow for inspiration

I think this is the future. being able to manipulate / validate objects individually, type checking them, but store them efficiently in tables without having to think about the schema

Apr 19 '25 08:04 choucavalier

@choucavalier agree it seems like a natural enhancement, possibly should be available directly through pyiceberg?

Apr 19 '25 14:04 potatochipcoconut

Type mapping is notoriously difficult. Since we already support bidirectional interchange between pyarrow schema and iceberg schema, i think it would be easier if we can map pydantic schema to pyarrow schema and then go from pyarrow schema to iceberg schema. wdyt?

Apr 19 '25 17:04 kevinjqliu

Pydantic, TypedDict, dataclasses, JSON schema, etc. There are many formats that could potentially generate iceberg schemas. It's not entirely obvious to me that pyiceberg should not implement APIs for working with these. Though adding pydantic as an extra dependency should not be necessary to install pyiceberg in my opinion (maybe it could be optional).

Working with pyarrow schemas is kinda low level. It increases the work required on users to define their iceberg schemas. Providing a more high level API that automatically generates iceberg schemas from their existing model representations would definitely be useful.

I've been working a lot with those recently. Happy to help.

Apr 20 '25 07:04 choucavalier

@choucavalier started playing with your lib and seems like generally what I was looking for, ran into issues around UUID (opened issues there and duckdb).

Locally I integrated into my own base model, not sure if it would make sense to pull into pyiceberg and expose that way or not? e.g.

class IcebergBaseModel(BaseModel):
    def model_dump_arrow(self):
        """
        Serialize schema for arrow compatibility.

        Uses model_dump() and converts UUID to bytes.
        """
        res = {}

        for k, v in self.model_dump().items():
            # convert uuid to bytes (pyarrow 19+).
            res[k] = v.bytes if is_valid_uuid(v) else v

        return res

    @classmethod
    def model_arrow_schema(cls):
        """
        Generate a pyarrow schema from pydantic schema.

        Returns:
            The pyarrow schema.
        """
        return get_pyarrow_schema(cls, allow_losing_tz=True)


------

from pyiceberg import IcebergBaseModel

class MySchema(IcebergBaseModel):
    id: uuid.UUID

Apr 25 '25 21:04 potatochipcoconut

Type mapping is notoriously difficult. Since we already support bidirectional interchange between pyarrow schema and iceberg schema, i think it would be easier if we can map pydantic schema to pyarrow schema and then go from pyarrow schema to iceberg schema. wdyt?

can we just leverage that pydantic-to-pyarrow lib and apply the conversion here for pydantic case: https://github.com/apache/iceberg-python/blob/main/pyiceberg/catalog/init.py#L742-L758

May 02 '25 08:05 jim-ngoo

@jim-ngoo so from Dx perspective we should just pass in a pydantic schema as e.g. catalog.create_table('mytable', schema=MyPydanticModel)? That would be nice

May 05 '25 15:05 potatochipcoconut

@jim-ngoo so from Dx perspective we should just pass in a pydantic schema as e.g. catalog.create_table('mytable', schema=MyPydanticModel)? That would be nice

I think so, tho I am not sure if we can just include the pydantic-to-pyarrow lib inside pyiceberg.

May 09 '25 07:05 jim-ngoo

@jim-ngoo so from Dx perspective we should just pass in a pydantic schema as e.g. catalog.create_table('mytable', schema=MyPydanticModel)? That would be nice

I think so, tho I am not sure if we can just include the pydantic-to-pyarrow lib inside pyiceberg.

Hi! I wrote (most of) the pydantic-to-pyarrow library. Happy to help with integration if needed. Existing code is so-so, partly due to stumbling across different ways different python versions treat different types.

As noted above, type mapping has a bunch of corner cases - even with the pyarrow implementation, I'm seeing odd beahvior with pa.uuid() (I'll raise in a separate issue). I'm sure the pyiceberg user base would bump into a lot more of these than the small number of users for pydantic-to-pyarrow. But pydantic is now so widely used, maybe it's worth hammering out those issues?

May 10 '25 23:05 simw

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Nov 12 '25 00:11 github-actions[bot]