Support native Pydantic schemas
Feature Request / Improvement
How difficult or feasible would it be to enable support for native Pydantic schemas that automatically map to the underlying pyiceberg types etc?
e.g. I would like to define a schema as a regular pydantic model:
class Document(Schema):
id: int
name: str
confidence: float = 0.0
and have it automatically converted to:
Schema(
NestedField(
field_id=1,
field_type=IntegerType(),
name="id",
required=True,
),
NestedField(
field_id=2,
field_type=StringType(),
name="name",
required=True,
),
NestedField(
field_id=3,
field_type=FloatType(),
initial_value=0.0,
name="confidence",
required=False,
),
)
New to both pyiceberg and pydantic so not sure if this is possible or where to start, thanks!
Came up with naive attempt, open to feedback. Not sure how it would handle e.g. int vs long, float vs double, etc
import builtins
import datetime
import uuid
from pydantic import BaseModel
from pydantic import Field
from pydantic_core import PydanticUndefined
from pyiceberg.schema import Schema as IcebergSchema
from pyiceberg.types import (
BooleanType,
DateType,
FloatType,
IntegerType,
MapType,
NestedField,
PrimitiveType,
StringType,
TimestampType,
UUIDType,
)
from typing import (
List,
Literal,
)
class UnknownType(PrimitiveType):
"""Remove after next public release."""
root: Literal['unknown'] = Field(default='unknown')
class Schema(BaseModel):
@classmethod
def model_pyiceberg_schema(cls):
"""
Generate a PyIceberg Schema for a model class.
Returns:
The pyiceberg Schema compatible with Apache Iceberg Tables.
"""
pyiceberg_fields: List[NestedField] = []
for index, (name, field) in enumerate(cls.model_fields.items()):
default = (
field.default if field.default != PydanticUndefined else None
)
match field.annotation:
case builtins.bool:
field_type = BooleanType()
case builtins.int:
field_type = IntegerType()
case datetime.date:
field_type = DateType()
case builtins.dict:
field_type = MapType()
case builtins.float:
field_type = FloatType()
case builtins.str:
field_type = StringType()
case datetime.datetime:
field_type = TimestampType()
case uuid.UUID:
field_type = UUIDType()
case _:
field_type = UnknownType()
pyiceberg_fields.append(
NestedField(
field_id=index + 1,
field_type=field_type,
initial_default=default, # not working, unsure why.
name=name,
required=field.is_required(),
),
)
return IcebergSchema(*pyiceberg_fields)
# define a schema using native pydantic class.
class Person(Schema):
name: str
age: int
location: str = 'Atlanta, GA'
dob: datetime.date
schema = Person.model_pyiceberg_schema()
# create table.
catalog.create_table('default.people', schema=schema.as_arrow())
You can look at https://github.com/simw/pydantic-to-pyarrow for inspiration
I think this is the future. being able to manipulate / validate objects individually, type checking them, but store them efficiently in tables without having to think about the schema
@choucavalier agree it seems like a natural enhancement, possibly should be available directly through pyiceberg?
Type mapping is notoriously difficult. Since we already support bidirectional interchange between pyarrow schema and iceberg schema, i think it would be easier if we can map pydantic schema to pyarrow schema and then go from pyarrow schema to iceberg schema. wdyt?
Pydantic, TypedDict, dataclasses, JSON schema, etc. There are many formats that could potentially generate iceberg schemas. It's not entirely obvious to me that pyiceberg should not implement APIs for working with these. Though adding pydantic as an extra dependency should not be necessary to install pyiceberg in my opinion (maybe it could be optional).
Working with pyarrow schemas is kinda low level. It increases the work required on users to define their iceberg schemas. Providing a more high level API that automatically generates iceberg schemas from their existing model representations would definitely be useful.
I've been working a lot with those recently. Happy to help.
@choucavalier started playing with your lib and seems like generally what I was looking for, ran into issues around UUID (opened issues there and duckdb).
Locally I integrated into my own base model, not sure if it would make sense to pull into pyiceberg and expose that way or not? e.g.
class IcebergBaseModel(BaseModel):
def model_dump_arrow(self):
"""
Serialize schema for arrow compatibility.
Uses model_dump() and converts UUID to bytes.
"""
res = {}
for k, v in self.model_dump().items():
# convert uuid to bytes (pyarrow 19+).
res[k] = v.bytes if is_valid_uuid(v) else v
return res
@classmethod
def model_arrow_schema(cls):
"""
Generate a pyarrow schema from pydantic schema.
Returns:
The pyarrow schema.
"""
return get_pyarrow_schema(cls, allow_losing_tz=True)
------
from pyiceberg import IcebergBaseModel
class MySchema(IcebergBaseModel):
id: uuid.UUID
Type mapping is notoriously difficult. Since we already support bidirectional interchange between pyarrow schema and iceberg schema, i think it would be easier if we can map pydantic schema to pyarrow schema and then go from pyarrow schema to iceberg schema. wdyt?
can we just leverage that pydantic-to-pyarrow lib and apply the conversion here for pydantic case: https://github.com/apache/iceberg-python/blob/main/pyiceberg/catalog/init.py#L742-L758
@jim-ngoo so from Dx perspective we should just pass in a pydantic schema as e.g. catalog.create_table('mytable', schema=MyPydanticModel)? That would be nice
@jim-ngoo so from Dx perspective we should just pass in a pydantic schema as e.g.
catalog.create_table('mytable', schema=MyPydanticModel)? That would be nice
I think so, tho I am not sure if we can just include the pydantic-to-pyarrow lib inside pyiceberg.
@jim-ngoo so from Dx perspective we should just pass in a pydantic schema as e.g.
catalog.create_table('mytable', schema=MyPydanticModel)? That would be niceI think so, tho I am not sure if we can just include the
pydantic-to-pyarrowlib inside pyiceberg.
Hi! I wrote (most of) the pydantic-to-pyarrow library. Happy to help with integration if needed. Existing code is so-so, partly due to stumbling across different ways different python versions treat different types.
As noted above, type mapping has a bunch of corner cases - even with the pyarrow implementation, I'm seeing odd beahvior with pa.uuid() (I'll raise in a separate issue). I'm sure the pyiceberg user base would bump into a lot more of these than the small number of users for pydantic-to-pyarrow. But pydantic is now so widely used, maybe it's worth hammering out those issues?
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.